Inspecting AArch64 disassembly of zlib-ng's level-9 deflate path showed that the rolling-hash insert loop emitted a redundant store of the running hash to memory on every byte processed. The cause was visible in the C source and the fix is small and self-contained.
insert_string_p.h configured HASH_CALC_VAR as s->ins_h for the rolling
variant. The template in insert_string_tpl.h then operated on that lvalue
directly inside the per-byte loop:
HASH_CALC(HASH_CALC_VAR, val); // s->ins_h = (s->ins_h << 5) ^ val
HASH_CALC_VAR &= HASH_CALC_MASK;
hm = HASH_CALC_VAR;
head = headp[hm];
if (LIKELY(head != idx)) {
prevp[idx & w_mask] = (Pos)head; // store through Pos*
headp[hm] = (Pos)idx; // store through Pos*
}prevp and headp are Pos* aliases of fields inside the same deflate_state
struct as s->ins_h. The C type system cannot prove these stores never alias
the integer field, so the compiler had to spill s->ins_h to memory and
reload it on every iteration.
The non-rolling variant already used a local (HASH_CALC_VAR = h), so it did
not have the spill.
; hot loop body, 13 instructions, includes redundant store
ldrb w14, [x12]
ubfiz w13, w13, #5, #10
eor w13, w13, w14
str w13, [x0, #0x70] ; <-- s->ins_h spill every iteration
ldrh w14, [x10, w13, uxtw #1]
cmp w1, w14
b.eq next
and w15, w1, w11
strh w14, [x8, w15, uxtw #1]
strh w1, [x10, w13, uxtw #1]
add w1, w1, #0x1
add x12, x12, #0x1
cmp x12, x9
b.lo loop_top
; hot loop body, 12 instructions, no per-byte spill
ldrb w14, [x13]
ubfiz w8, w8, #5, #10
eor w8, w8, w14
ldrh w14, [x11, w8, uxtw #1]
cmp w1, w14
b.eq next
and w15, w1, w12
strh w14, [x9, w15, uxtw #1]
strh w1, [x11, w8, uxtw #1]
add w1, w1, #0x1
add x13, x13, #0x1
cmp x13, x10
b.lo loop_top
The new code loads s->ins_h once at function entry, holds the running hash
in w8 for the duration of the loop, and stores s->ins_h once at exit.
Three macros let the template control hash-state lifetime:
HASH_CALC_VAR— the lvalue used for the running hash (nowh, a local).HASH_CALC_VAR_INIT— declares and seeds the local froms->ins_h.HASH_CALC_VAR_STORE— writes the local back tos->ins_h.
The template hoists HASH_CALC_VAR_INIT above the loop and emits
HASH_CALC_VAR_STORE after it. Both macros are no-ops for the non-rolling
variant where the running hash is recomputed from val each iteration, so
that path is byte-identical to before.
Apple M5, macOS Darwin 25.4.0 arm64, Apple clang 17, Release build, Google Benchmark, 7 reps, aggregates only.
| count | Δ time |
|---|---|
| 3 | −4.4% |
| 4 | −5.3% |
| 5 | −7.4% |
| 7 | −8.6% |
| 14 | −4.0% |
| 32 | −0.6% |
| 127 | −2.3% |
| 255 | −5.4% |
| size | Δ time after thermal symmetry |
|---|---|
| 131072 | −1.9% |
| 1048576 | within noise |
A second run with reversed order revealed the original macro-level numbers
were thermally biased on this machine; the symmetric estimate
R = (run1 − run2) / 2 returned the result above for level 9 / 131072 and
left level 9 / 1048576 below the noise floor (CV 1–6%). Compression ratios
unchanged across all sizes.
The non-rolling deflate paths (levels 1–8) are unaffected — the generated
assembly for _insert_string is byte-identical before and after.
insert_string_roll.patch— the change.