zlib-ng PR #2291 — struct-local experiment for strstart/lookahead

Follow-up to PR #2291 (nv/develop/deflate-strategy-locals-hoist). After the scalar-local hoist landed, tried two variants in deflate_quick.c to see whether packing the two uint32_t fields into a struct could coax the compiler into 64-bit load/store transfers.

Machine

Apple M5, macOS 26.4.1, Apple clang 21.0.0
AArch64 native, x86_64 cross-compile via CMAKE_OSX_ARCHITECTURES=x86_64
CMake Release, -D BUILD_SHARED_LIBS=OFF

Variants tried

Baseline (PR #2291 head): two scalar locals

unsigned int lookahead = s->lookahead;
unsigned int strstart = s->strstart;

Variant 1: local struct with brace initializer

struct { uint32_t strstart, lookahead; } e = { s->strstart, s->lookahead };

Variant 2: anonymous union in deflate_state, struct copy from s->hot

/* in deflate.h */
typedef struct deflate_hot_s {
    uint32_t lookahead;
    uint32_t strstart;
} deflate_hot;

struct internal_state {
    ...
    union {
        deflate_hot hot;
        struct {
            unsigned int  lookahead;
            unsigned int strstart;
        };
    };
    ...
};

/* in deflate_quick.c */
deflate_hot e = s->hot;
/* ... use e.lookahead, e.strstart ... */
s->hot = e;

Asm comparison

Function body of deflate_quick, both arches.

Variant	AArch64 lines	x86_64 lines	x86_64 spills
Scalar locals (baseline)	417	457	18
Variant 1: brace-init local struct	417	457	18
Variant 2: union + `s->hot` copy	417	457	18

AArch64: byte-for-byte identical across all three variants. The optimizer already emits a single ldp for the adjacent 32-bit fields whether the source is two assignments or a struct copy.

ldp w23, w22, [x0, #0x40]   ; load lookahead + strstart in one instruction
...
stp w23, w22, [x19, #0x40]  ; store back in one instruction

x86_64: identical (Variant 1 had a minor scheduling reorder of three movl already noted in the PR-thread post). All variants emit two movl per LOAD and per STORE. x86 has no ldp-equivalent for paired 32-bit GPR loads.

movl  0x40(%rdi), %r12d   ; lookahead
movl  0x44(%rdi), %r13d   ; strstart
...
movl  %r12d, 0x40(%rbx)
movl  %r13d, 0x44(%rbx)

Why none of the variants change codegen

LLVM's SROA (Scalar Replacement of Aggregates) pass decomposes small structs into their constituent scalars very early. By the time codegen runs, there's no struct left to copy as a unit — just independent scalar fields. The compiler then emits whatever the target ISA prefers for two independent same-size scalars:

AArch64 is smart enough to coalesce two adjacent same-size scalar loads/stores into ldp/stp after SROA. So the source representation doesn't matter — the codegen is already optimal.
x86_64 has no instruction that loads two 32-bit values into two GPRs in one shot. movq would load both into one 64-bit register, requiring a subsequent extract — strictly more work. Two movl is already the right answer.

Diff that was tested

diff --git a/deflate.h b/deflate.h
index 3f9f8f46..bba5773a 100644
--- a/deflate.h
+++ b/deflate.h
@@ -120,6 +120,14 @@ typedef uint16_t Pos;
 /* Type definitions for hash callbacks */
 typedef struct internal_state deflate_state;

+/* Hot fields of the deflate strategy: position cursor and remaining-input
+ * counter.  Embedded as a union inside deflate_state (see below) so a struct
+ * copy can move both with a single 64-bit transfer. */
+typedef struct deflate_hot_s {
+    uint32_t lookahead;
+    uint32_t strstart;
+} deflate_hot;
+
 typedef void     (* insert_string_cb)      (deflate_state *const s, uint32_t str, uint32_t count);
 void     insert_string           (deflate_state *const s, uint32_t str, uint32_t count);
 void     insert_string_roll      (deflate_state *const s, uint32_t str, uint32_t count);
@@ -156,8 +164,13 @@ struct ALIGNED_(64) internal_state {

                 /* Cacheline 1 */

-    unsigned int  lookahead;    /* number of valid bytes ahead in window */
-    unsigned int strstart;      /* start of string to insert */
+    union {
+        deflate_hot hot;
+        struct {
+            unsigned int  lookahead;    /* number of valid bytes ahead in window */
+            unsigned int strstart;      /* start of string to insert */
+        };
+    };
     unsigned int  w_size;       /* LZ77 window size (32K by default) */

     int block_start;            /* Window position at the beginning of the current output block.
diff --git a/deflate_quick.c b/deflate_quick.c
--- a/deflate_quick.c
+++ b/deflate_quick.c
@@ -48,8 +48,7 @@
 Z_INTERNAL block_state deflate_quick(deflate_state *s, int flush) {
     unsigned char *window;
     unsigned last = (flush == Z_FINISH) ? 1 : 0;
-    unsigned int lookahead = s->lookahead;
-    unsigned int strstart = s->strstart;
+    deflate_hot e = s->hot;

     ... (rest of function uses e.lookahead / e.strstart instead of scalars)

Takeaway

The PR #2291 scalar-local hoist is the right form. Wrapping the two fields in a struct or union adds source-level ceremony without changing the generated code on either AArch64 or x86_64. The compiler's SROA + load coalescing already produces the optimal instructions; the union trick from the deflate-struct-hoist PR only helped because that struct was larger (32 bytes, 4 fields) and the STORE wrote enough adjacent bytes to escape SROA decomposition — at the cost of writing back invariant fields.

nmoinvaz/zlib-ng-pr2291-strstart-lookahead-struct-experiment.md

Select an option

No results found