Follow-up to PR #2291 (nv/develop/deflate-strategy-locals-hoist). After the scalar-local hoist landed, tried two variants in deflate_quick.c to see whether packing the two uint32_t fields into a struct could coax the compiler into 64-bit load/store transfers.
- Apple M5, macOS 26.4.1, Apple clang 21.0.0
- AArch64 native, x86_64 cross-compile via
CMAKE_OSX_ARCHITECTURES=x86_64 - CMake Release,
-D BUILD_SHARED_LIBS=OFF
Baseline (PR #2291 head): two scalar locals
unsigned int lookahead = s->lookahead;
unsigned int strstart = s->strstart;Variant 1: local struct with brace initializer
struct { uint32_t strstart, lookahead; } e = { s->strstart, s->lookahead };Variant 2: anonymous union in deflate_state, struct copy from s->hot
/* in deflate.h */
typedef struct deflate_hot_s {
uint32_t lookahead;
uint32_t strstart;
} deflate_hot;
struct internal_state {
...
union {
deflate_hot hot;
struct {
unsigned int lookahead;
unsigned int strstart;
};
};
...
};/* in deflate_quick.c */
deflate_hot e = s->hot;
/* ... use e.lookahead, e.strstart ... */
s->hot = e;Function body of deflate_quick, both arches.
| Variant | AArch64 lines | x86_64 lines | x86_64 spills |
|---|---|---|---|
| Scalar locals (baseline) | 417 | 457 | 18 |
| Variant 1: brace-init local struct | 417 | 457 | 18 |
Variant 2: union + s->hot copy |
417 | 457 | 18 |
AArch64: byte-for-byte identical across all three variants. The optimizer already emits a single ldp for the adjacent 32-bit fields whether the source is two assignments or a struct copy.
ldp w23, w22, [x0, #0x40] ; load lookahead + strstart in one instruction
...
stp w23, w22, [x19, #0x40] ; store back in one instruction
x86_64: identical (Variant 1 had a minor scheduling reorder of three movl already noted in the PR-thread post). All variants emit two movl per LOAD and per STORE. x86 has no ldp-equivalent for paired 32-bit GPR loads.
movl 0x40(%rdi), %r12d ; lookahead
movl 0x44(%rdi), %r13d ; strstart
...
movl %r12d, 0x40(%rbx)
movl %r13d, 0x44(%rbx)
LLVM's SROA (Scalar Replacement of Aggregates) pass decomposes small structs into their constituent scalars very early. By the time codegen runs, there's no struct left to copy as a unit — just independent scalar fields. The compiler then emits whatever the target ISA prefers for two independent same-size scalars:
- AArch64 is smart enough to coalesce two adjacent same-size scalar loads/stores into
ldp/stpafter SROA. So the source representation doesn't matter — the codegen is already optimal. - x86_64 has no instruction that loads two 32-bit values into two GPRs in one shot.
movqwould load both into one 64-bit register, requiring a subsequent extract — strictly more work. Twomovlis already the right answer.
diff --git a/deflate.h b/deflate.h
index 3f9f8f46..bba5773a 100644
--- a/deflate.h
+++ b/deflate.h
@@ -120,6 +120,14 @@ typedef uint16_t Pos;
/* Type definitions for hash callbacks */
typedef struct internal_state deflate_state;
+/* Hot fields of the deflate strategy: position cursor and remaining-input
+ * counter. Embedded as a union inside deflate_state (see below) so a struct
+ * copy can move both with a single 64-bit transfer. */
+typedef struct deflate_hot_s {
+ uint32_t lookahead;
+ uint32_t strstart;
+} deflate_hot;
+
typedef void (* insert_string_cb) (deflate_state *const s, uint32_t str, uint32_t count);
void insert_string (deflate_state *const s, uint32_t str, uint32_t count);
void insert_string_roll (deflate_state *const s, uint32_t str, uint32_t count);
@@ -156,8 +164,13 @@ struct ALIGNED_(64) internal_state {
/* Cacheline 1 */
- unsigned int lookahead; /* number of valid bytes ahead in window */
- unsigned int strstart; /* start of string to insert */
+ union {
+ deflate_hot hot;
+ struct {
+ unsigned int lookahead; /* number of valid bytes ahead in window */
+ unsigned int strstart; /* start of string to insert */
+ };
+ };
unsigned int w_size; /* LZ77 window size (32K by default) */
int block_start; /* Window position at the beginning of the current output block.
diff --git a/deflate_quick.c b/deflate_quick.c
--- a/deflate_quick.c
+++ b/deflate_quick.c
@@ -48,8 +48,7 @@
Z_INTERNAL block_state deflate_quick(deflate_state *s, int flush) {
unsigned char *window;
unsigned last = (flush == Z_FINISH) ? 1 : 0;
- unsigned int lookahead = s->lookahead;
- unsigned int strstart = s->strstart;
+ deflate_hot e = s->hot;
... (rest of function uses e.lookahead / e.strstart instead of scalars)The PR #2291 scalar-local hoist is the right form. Wrapping the two fields in a struct or union adds source-level ceremony without changing the generated code on either AArch64 or x86_64. The compiler's SROA + load coalescing already produces the optimal instructions; the union trick from the deflate-struct-hoist PR only helped because that struct was larger (32 bytes, 4 fields) and the STORE wrote enough adjacent bytes to escape SROA decomposition — at the cost of writing back invariant fields.