Skip to content

Instantly share code, notes, and snippets.

@nmoinvaz
Created May 8, 2026 06:21
Show Gist options
  • Select an option

  • Save nmoinvaz/e1fdd4d7af0a144d7037d325f2d6c523 to your computer and use it in GitHub Desktop.

Select an option

Save nmoinvaz/e1fdd4d7af0a144d7037d325f2d6c523 to your computer and use it in GitHub Desktop.
zlib-ng: eliminate s->ins_h spill in insert_string_roll inner loop

zlib-ng: eliminate s->ins_h spill in insert_string_roll inner loop

Background

Inspecting AArch64 disassembly of zlib-ng's level-9 deflate path showed that the rolling-hash insert loop emitted a redundant store of the running hash to memory on every byte processed. The cause was visible in the C source and the fix is small and self-contained.

Root cause

insert_string_p.h configured HASH_CALC_VAR as s->ins_h for the rolling variant. The template in insert_string_tpl.h then operated on that lvalue directly inside the per-byte loop:

HASH_CALC(HASH_CALC_VAR, val);          // s->ins_h = (s->ins_h << 5) ^ val
HASH_CALC_VAR &= HASH_CALC_MASK;
hm = HASH_CALC_VAR;
head = headp[hm];
if (LIKELY(head != idx)) {
    prevp[idx & w_mask] = (Pos)head;    // store through Pos*
    headp[hm] = (Pos)idx;               // store through Pos*
}

prevp and headp are Pos* aliases of fields inside the same deflate_state struct as s->ins_h. The C type system cannot prove these stores never alias the integer field, so the compiler had to spill s->ins_h to memory and reload it on every iteration.

The non-rolling variant already used a local (HASH_CALC_VAR = h), so it did not have the spill.

Generated code (AArch64, Apple clang 17)

Before

; hot loop body, 13 instructions, includes redundant store
ldrb  w14, [x12]
ubfiz w13, w13, #5, #10
eor   w13, w13, w14
str   w13, [x0, #0x70]            ; <-- s->ins_h spill every iteration
ldrh  w14, [x10, w13, uxtw #1]
cmp   w1, w14
b.eq  next
and   w15, w1, w11
strh  w14, [x8, w15, uxtw #1]
strh  w1, [x10, w13, uxtw #1]
add   w1, w1, #0x1
add   x12, x12, #0x1
cmp   x12, x9
b.lo  loop_top

After

; hot loop body, 12 instructions, no per-byte spill
ldrb  w14, [x13]
ubfiz w8,  w8,  #5, #10
eor   w8,  w8,  w14
ldrh  w14, [x11, w8, uxtw #1]
cmp   w1, w14
b.eq  next
and   w15, w1, w12
strh  w14, [x9, w15, uxtw #1]
strh  w1, [x11, w8, uxtw #1]
add   w1, w1, #0x1
add   x13, x13, #0x1
cmp   x13, x10
b.lo  loop_top

The new code loads s->ins_h once at function entry, holds the running hash in w8 for the duration of the loop, and stores s->ins_h once at exit.

Fix

Three macros let the template control hash-state lifetime:

  • HASH_CALC_VAR — the lvalue used for the running hash (now h, a local).
  • HASH_CALC_VAR_INIT — declares and seeds the local from s->ins_h.
  • HASH_CALC_VAR_STORE — writes the local back to s->ins_h.

The template hoists HASH_CALC_VAR_INIT above the loop and emits HASH_CALC_VAR_STORE after it. Both macros are no-ops for the non-rolling variant where the running hash is recomputed from val each iteration, so that path is byte-identical to before.

Performance

Apple M5, macOS Darwin 25.4.0 arm64, Apple clang 17, Release build, Google Benchmark, 7 reps, aggregates only.

Targeted micro-benchmark (insert_string_bench/rolling_hash)

count Δ time
3 −4.4%
4 −5.3%
5 −7.4%
7 −8.6%
14 −4.0%
32 −0.6%
127 −2.3%
255 −5.4%

End-to-end deflate level 9

size Δ time after thermal symmetry
131072 −1.9%
1048576 within noise

A second run with reversed order revealed the original macro-level numbers were thermally biased on this machine; the symmetric estimate R = (run1 − run2) / 2 returned the result above for level 9 / 131072 and left level 9 / 1048576 below the noise floor (CV 1–6%). Compression ratios unchanged across all sizes.

The non-rolling deflate paths (levels 1–8) are unaffected — the generated assembly for _insert_string is byte-identical before and after.

Files

  • insert_string_roll.patch — the change.
diff --git a/insert_string_p.h b/insert_string_p.h
index e768b3ea..077e004f 100644
--- a/insert_string_p.h
+++ b/insert_string_p.h
@@ -15,6 +15,7 @@
#define HASH_CALC_MASK HASH_MASK
#define HASH_CALC_VAR h
#define HASH_CALC_VAR_INIT uint32_t h
+#define HASH_CALC_VAR_STORE
#define HASH_CALC_OFFSET 0
#define UPDATE_HASH update_hash
@@ -28,8 +29,9 @@
#define HASH_SLIDE 5
#define HASH_CALC(h, val) h = ((h << HASH_SLIDE) ^ ((uint8_t)val))
-#define HASH_CALC_VAR s->ins_h
-#define HASH_CALC_VAR_INIT
+#define HASH_CALC_VAR h
+#define HASH_CALC_VAR_INIT uint32_t h = s->ins_h
+#define HASH_CALC_VAR_STORE s->ins_h = h
#define HASH_CALC_READ val = strstart[0]
#define HASH_CALC_MASK (32768u - 1u)
#define HASH_CALC_OFFSET (STD_MIN_MATCH-1)
diff --git a/insert_string_tpl.h b/insert_string_tpl.h
index f8318e4e..e8a33d04 100644
--- a/insert_string_tpl.h
+++ b/insert_string_tpl.h
@@ -46,6 +46,7 @@ Z_FORCEINLINE static uint32_t QUICK_INSERT_VALUE(deflate_state *const s, uint32_
HASH_CALC(HASH_CALC_VAR, val);
HASH_CALC_VAR &= HASH_CALC_MASK;
hm = HASH_CALC_VAR;
+ HASH_CALC_VAR_STORE;
head = s->head[hm];
if (LIKELY(head != str)) {
@@ -69,6 +70,7 @@ Z_FORCEINLINE static uint32_t QUICK_INSERT_STRING(deflate_state *const s, uint32
HASH_CALC(HASH_CALC_VAR, val);
HASH_CALC_VAR &= HASH_CALC_MASK;
hm = HASH_CALC_VAR;
+ HASH_CALC_VAR_STORE;
head = s->head[hm];
if (LIKELY(head != str)) {
@@ -95,10 +97,11 @@ Z_FORCEINLINE static void INSERT_STRING(deflate_state *const s, uint32_t str, ui
Pos *prevp = s->prev;
const unsigned int w_mask = W_MASK(s);
+ HASH_CALC_VAR_INIT;
+
for (uint32_t idx = str; strstart < strend; idx++, strstart++) {
uint32_t val, hm, head;
- HASH_CALC_VAR_INIT;
HASH_CALC_READ;
HASH_CALC(HASH_CALC_VAR, val);
HASH_CALC_VAR &= HASH_CALC_MASK;
@@ -110,6 +113,8 @@ Z_FORCEINLINE static void INSERT_STRING(deflate_state *const s, uint32_t str, ui
headp[hm] = (Pos)idx;
}
}
+
+ HASH_CALC_VAR_STORE;
}
// Cleanup
@@ -120,6 +125,7 @@ Z_FORCEINLINE static void INSERT_STRING(deflate_state *const s, uint32_t str, ui
#undef HASH_CALC_OFFSET
#undef HASH_CALC_VAR
#undef HASH_CALC_VAR_INIT
+#undef HASH_CALC_VAR_STORE
#undef UPDATE_HASH
#undef INSERT_STRING
#undef QUICK_INSERT_STRING
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment