Skip to content

Instantly share code, notes, and snippets.

@nmoinvaz
Last active May 12, 2026 04:33
Show Gist options
  • Select an option

  • Save nmoinvaz/325c148aef6d38b6e468eaf0c19f7d71 to your computer and use it in GitHub Desktop.

Select an option

Save nmoinvaz/325c148aef6d38b6e468eaf0c19f7d71 to your computer and use it in GitHub Desktop.
zlib-ng deflate-struct-hoist: codegen and benchmark analysis

zlib-ng: deflate-struct-hoist — codegen and benchmark analysis

Branch: nv/develop/deflate-struct-hoist Commit: c435d01e — "Hoist deflate bit-emit state into deflate_emit_hot local" Base: upstream/develop (48087450)

Change

Introduces a deflate_emit_hot struct (bi_buf, bi_valid, pending_buf, pending) and DEFLATE_EMIT_HOT_LOAD/STORE macros that cache bit-emit state in a local at the top of hot loops and write it back once on exit. Converts put_byte/put_short/put_short_msb/put_uint32/put_uint32_msb/put_uint64 from macros to static inline functions taking a deflate_emit_hot *. Cold-path callers in deflate.c (zlib/gzip header and trailer writers, name/comment loops) and deflate_stored.c bracket their put_* clusters with LOAD/STORE.

Net diff: 5 files, +233 / -190.

Machine

  • Apple M5 (10 physical / 10 logical cores), 32 GiB RAM
  • macOS 26.4.1 (build 25E253)
  • Apple clang 21.0.0
  • CMake Release, -D BUILD_SHARED_LIBS=OFF

Codegen impact (per-function, AArch64)

Comparing .o output of upstream/develop (base) vs HEAD.

Hot functions in trees.c

Function base ld/st head ld/st Δ ld/st base insns head insns Δ insns
compress_block 43 32 -25.6% 130 119 -8.5%
send_tree 96 37 -61.5% 298 240 -19.5%
zng_tr_flush_bits 71 39 -45.1%

send_tree lost 59 load/store instructions (~61% reduction) because each emit in its tight loop now operates on register-resident e.bi_buf / e.bi_valid instead of reloading from s every call.

compress_block gained 2 stack slots for the e local (4 → 6 spills) but those are setup, not in the loop. Inside the loop, ld/st count is down 25%.

Whole-file size deltas

Object base lines head lines Δ
trees.o 1930 1752 -178 (-9.2%)
deflate_quick.o 457 411 -46 (-10.1%)
deflate.o 2338 2336 -2
deflate_stored.o 360 358 -2

deflate_quick.o shrank ~10% even though deflate_quick.c itself is unchanged in this PR — the win comes from zng_tr_emit_tree / zng_tr_emit_end_block (called from quick_start_block / quick_end_block) now inlining the hoisted form.

Macro → static inline conversion

Verified codegen-equivalent: switching put_* macros to static inline deflate_emit_hot * functions produces only commutative orr operand swaps in trees.o and one 32-bit/64-bit neg form swap in deflate_quick.o. No instruction count, register pressure, or scheduling changes that affect the critical path. The conversion is a clarity/type-safety win, not a perf change.

Benchmark results

benchmark_zlib filtered to deflate_bench/deflate_level/{131072,1048576}/{1,3,6,9}, 5 repetitions with --benchmark_cooldown=5, aggregates only. CVs 0.46-1.5%.

Strategy Level Size Δ wall Δ CPU Δ ratio
deflate_quick 1 128K -1.18% -1.01% -0.01%
deflate_quick 1 1M -2.03% -2.04% 0.00%
deflate_fast 3 128K -0.07% +0.02% -0.01%
deflate_fast 3 1M -0.28% -0.28% 0.00%
deflate_medium 6 128K +1.86% +1.68% 0.00%
deflate_medium 6 1M +1.20% +1.32% -0.01%
deflate_slow 9 128K -0.84% -0.87% -0.01%
deflate_slow 9 1M -1.19% -1.18% 0.00%
Geomean -0.32% -0.01%

Notable patterns:

  • Level 1 (quick) wins 1-2% even though deflate_quick.c is unchanged in this PR. The gain comes from zng_tr_emit_tree / zng_tr_emit_end_block inlining the hoisted form into quick_start_block / quick_end_block.
  • Level 9 (slow) wins ~1%.
  • Level 6 (medium) regresses 1.2-1.9% — the only consistent regression. Outside typical CV but small enough to be code-layout or register-allocation noise from the inline conversion.

Compression ratio is essentially identical across the board (±1 byte = gzip-header timestamp jitter).

Why the asm wins don't translate to bench wins

  1. Bit-emit is a small fraction of compress_block total time. Most cycles go to Huffman lookup, hash insertion, and match-length compute — saving 11 of 130 instructions in the function is ~8% of static count but a much smaller fraction of dynamic critical-path latency.
  2. The eliminated stores were going to the L1 cache / store buffer, which an out-of-order core hides cheaply.
  3. send_tree runs once per Huffman block (~one per 16-32 KB of input on level 6), so its -58 instruction win amortizes thin across overall throughput.

Summary

  • Real codegen win on bit-emit hot paths (-25% to -61% ld/st in key functions).
  • Whole-file .o size down ~9-10% for trees.o and deflate_quick.o.
  • Levels 1, 3, 9: flat to ~2% wall-time improvement.
  • Level 6 (deflate_medium): consistent ~1.2-1.9% regression worth investigating.
  • Compression ratio unchanged across all levels.
  • Bonus: explicit type for bit-emit state, type-safe inline put_* helpers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment