The level-1 deflate strategy deflate_quick runs a tight per-byte loop that
reads and writes s->strstart and s->lookahead on every iteration:
- the lookahead-bound check at the top of the loop (
s->lookahead < MIN_LOOKAHEAD,s->lookahead >= WANT_MIN_MATCH) - the literal/match fast path reads
window + s->strstart, passess->strstarttoquick_insert_value, and accessess->lookaheadfor the match cap - on a literal:
s->strstart++; s->lookahead-- - on a match:
s->lookahead -= match_len; s->strstart += match_len
These are simple unsigned ints inside the deflate state struct, but on
AArch64 with Apple clang they get reloaded from / stored to memory on
every iteration of the inner loop. Aliasing analysis can't prove that
the writes the loop performs through other pointers (notably the
s->pending_buf[] writes done by the inlined zng_tr_emit_* helpers)
don't touch the addresses of these fields, so the compiler keeps them
addressable.
Lift strstart and lookahead to local lvalues at function entry. Use the
locals throughout the inner loop. Sync back to s before the two callouts
that observe / mutate them:
fill_window(mutatess->strstartands->lookaheadwhen sliding the window) — sync, call, reload.quick_start_block/quick_end_block(reads->strstartto sets->block_start) — sync before, no reload needed (those helpers don't modify the locals).
flush_pending doesn't touch either field, so the only sync needed around
that call is for the early-return path (where we have to commit local state
before exiting).
The locals never have their address taken, which means the compiler has a hard guarantee no other pointer in scope can touch them. They can live in registers across the entire inner loop.
Originally the inner loop had two separate code paths for setting the
literal byte lc:
if (LIKELY(s->lookahead >= WANT_MIN_MATCH)) {
uint32_t str_val = Z_U32_FROM_LE(zng_memread_4(window + s->strstart));
...
lc = (uint8_t)str_val;
if (match path) { emit_dist; continue; }
} else {
lc = window[s->strstart]; // dedicated 1-byte load
}
zng_tr_emit_lit(s, static_ltree, lc);The two assignments compute the same byte. zlib-ng's window has high_water
padding past lookahead, so a 4-byte read at window + strstart is always
safe regardless of how small lookahead is. That lets the load be hoisted
above the if and the lc variable retired entirely:
uint32_t scan_val = Z_U32_FROM_LE(zng_memread_4(window + strstart));
if (LIKELY(lookahead >= WANT_MIN_MATCH)) {
...
if (match path) { emit_dist; continue; }
}
zng_tr_emit_lit(s, static_ltree, (uint8_t)scan_val);The variable rename str_val → scan_val (and str_start → scan_start
elsewhere in the body) parallels the existing match_val / match_start
pair on the candidate side of the match comparison.
The inner loop now does a single 4-byte load of scan_val and reuses it
for: hash multiplier input, scan_val == match_val equality check, and the
final (uint8_t)scan_val for the literal-emit byte. The dedicated ldrb
that the prior lc = window[s->strstart] path emitted is gone — the rare
lookahead < 4 path now falls through into the same and w_, w_, #0xff
the common path uses.
Common-path codegen is unchanged. Function size shrinks by one branch target. C is one variable (and one branch arm) lighter.
Apple M5, macOS Darwin 25.4.0 arm64, Apple clang 17, Release build, Google Benchmark, 10 reps, aggregates only.
| benchmark | Δ time |
|---|---|
| level 1 / 131072 | −13.44% |
| level 1 / 1048576 | −12.69% |
Compression ratios bit-identical (Δ ratio = 0.00, Δ bytes = 0) at both sizes.
Levels 3/6/9 are unaffected because they do not use deflate_quick.
deflate_quick.patch— the change.