Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save nmoinvaz/42e233e9e38bc61a93c4106bca65757a to your computer and use it in GitHub Desktop.

Select an option

Save nmoinvaz/42e233e9e38bc61a93c4106bca65757a to your computer and use it in GitHub Desktop.
zlib-ng PR #2291 struct-local experiment for strstart/lookahead

zlib-ng PR #2291 — struct-local experiment for strstart/lookahead

Follow-up to PR #2291 (nv/develop/deflate-strategy-locals-hoist). After the scalar-local hoist landed, tried two variants in deflate_quick.c to see whether packing the two uint32_t fields into a struct could coax the compiler into 64-bit load/store transfers.

Machine

  • Apple M5, macOS 26.4.1, Apple clang 21.0.0
  • AArch64 native, x86_64 cross-compile via CMAKE_OSX_ARCHITECTURES=x86_64
  • CMake Release, -D BUILD_SHARED_LIBS=OFF

Variants tried

Baseline (PR #2291 head): two scalar locals

unsigned int lookahead = s->lookahead;
unsigned int strstart = s->strstart;

Variant 1: local struct with brace initializer

struct { uint32_t strstart, lookahead; } e = { s->strstart, s->lookahead };

Variant 2: anonymous union in deflate_state, struct copy from s->hot

/* in deflate.h */
typedef struct deflate_hot_s {
    uint32_t lookahead;
    uint32_t strstart;
} deflate_hot;

struct internal_state {
    ...
    union {
        deflate_hot hot;
        struct {
            unsigned int  lookahead;
            unsigned int strstart;
        };
    };
    ...
};
/* in deflate_quick.c */
deflate_hot e = s->hot;
/* ... use e.lookahead, e.strstart ... */
s->hot = e;

Asm comparison

Function body of deflate_quick, both arches.

Variant AArch64 lines x86_64 lines x86_64 spills
Scalar locals (baseline) 417 457 18
Variant 1: brace-init local struct 417 457 18
Variant 2: union + s->hot copy 417 457 18

AArch64: byte-for-byte identical across all three variants. The optimizer already emits a single ldp for the adjacent 32-bit fields whether the source is two assignments or a struct copy.

ldp w23, w22, [x0, #0x40]   ; load lookahead + strstart in one instruction
...
stp w23, w22, [x19, #0x40]  ; store back in one instruction

x86_64: identical (Variant 1 had a minor scheduling reorder of three movl already noted in the PR-thread post). All variants emit two movl per LOAD and per STORE. x86 has no ldp-equivalent for paired 32-bit GPR loads.

movl  0x40(%rdi), %r12d   ; lookahead
movl  0x44(%rdi), %r13d   ; strstart
...
movl  %r12d, 0x40(%rbx)
movl  %r13d, 0x44(%rbx)

Why none of the variants change codegen

LLVM's SROA (Scalar Replacement of Aggregates) pass decomposes small structs into their constituent scalars very early. By the time codegen runs, there's no struct left to copy as a unit — just independent scalar fields. The compiler then emits whatever the target ISA prefers for two independent same-size scalars:

  • AArch64 is smart enough to coalesce two adjacent same-size scalar loads/stores into ldp/stp after SROA. So the source representation doesn't matter — the codegen is already optimal.
  • x86_64 has no instruction that loads two 32-bit values into two GPRs in one shot. movq would load both into one 64-bit register, requiring a subsequent extract — strictly more work. Two movl is already the right answer.

Diff that was tested

diff --git a/deflate.h b/deflate.h
index 3f9f8f46..bba5773a 100644
--- a/deflate.h
+++ b/deflate.h
@@ -120,6 +120,14 @@ typedef uint16_t Pos;
 /* Type definitions for hash callbacks */
 typedef struct internal_state deflate_state;

+/* Hot fields of the deflate strategy: position cursor and remaining-input
+ * counter.  Embedded as a union inside deflate_state (see below) so a struct
+ * copy can move both with a single 64-bit transfer. */
+typedef struct deflate_hot_s {
+    uint32_t lookahead;
+    uint32_t strstart;
+} deflate_hot;
+
 typedef void     (* insert_string_cb)      (deflate_state *const s, uint32_t str, uint32_t count);
 void     insert_string           (deflate_state *const s, uint32_t str, uint32_t count);
 void     insert_string_roll      (deflate_state *const s, uint32_t str, uint32_t count);
@@ -156,8 +164,13 @@ struct ALIGNED_(64) internal_state {

                 /* Cacheline 1 */

-    unsigned int  lookahead;    /* number of valid bytes ahead in window */
-    unsigned int strstart;      /* start of string to insert */
+    union {
+        deflate_hot hot;
+        struct {
+            unsigned int  lookahead;    /* number of valid bytes ahead in window */
+            unsigned int strstart;      /* start of string to insert */
+        };
+    };
     unsigned int  w_size;       /* LZ77 window size (32K by default) */

     int block_start;            /* Window position at the beginning of the current output block.
diff --git a/deflate_quick.c b/deflate_quick.c
--- a/deflate_quick.c
+++ b/deflate_quick.c
@@ -48,8 +48,7 @@
 Z_INTERNAL block_state deflate_quick(deflate_state *s, int flush) {
     unsigned char *window;
     unsigned last = (flush == Z_FINISH) ? 1 : 0;
-    unsigned int lookahead = s->lookahead;
-    unsigned int strstart = s->strstart;
+    deflate_hot e = s->hot;

     ... (rest of function uses e.lookahead / e.strstart instead of scalars)

Takeaway

The PR #2291 scalar-local hoist is the right form. Wrapping the two fields in a struct or union adds source-level ceremony without changing the generated code on either AArch64 or x86_64. The compiler's SROA + load coalescing already produces the optimal instructions; the union trick from the deflate-struct-hoist PR only helped because that struct was larger (32 bytes, 4 fields) and the STORE wrote enough adjacent bytes to escape SROA decomposition — at the cost of writing back invariant fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment