zlib-ng: deflate-struct-hoist — codegen and benchmark analysis

Branch: nv/develop/deflate-struct-hoist Commit: c435d01e — "Hoist deflate bit-emit state into deflate_emit_hot local" Base: upstream/develop (48087450)

Change

Introduces a deflate_emit_hot struct (bi_buf, bi_valid, pending_buf, pending) and DEFLATE_EMIT_HOT_LOAD/STORE macros that cache bit-emit state in a local at the top of hot loops and write it back once on exit. Converts put_byte/put_short/put_short_msb/put_uint32/put_uint32_msb/put_uint64 from macros to static inline functions taking a deflate_emit_hot *. Cold-path callers in deflate.c (zlib/gzip header and trailer writers, name/comment loops) and deflate_stored.c bracket their put_* clusters with LOAD/STORE.

Net diff: 5 files, +233 / -190.

Machine

Apple M5 (10 physical / 10 logical cores), 32 GiB RAM
macOS 26.4.1 (build 25E253)
Apple clang 21.0.0
CMake Release, -D BUILD_SHARED_LIBS=OFF

Codegen impact (per-function, AArch64)

Comparing .o output of upstream/develop (base) vs HEAD.

Hot functions in `trees.c`

Function	base ld/st	head ld/st	Δ ld/st	base insns	head insns	Δ insns
`compress_block`	43	32	-25.6%	130	119	-8.5%
`send_tree`	96	37	-61.5%	298	240	-19.5%
`zng_tr_flush_bits`	—	—	—	71	39	-45.1%

send_tree lost 59 load/store instructions (~61% reduction) because each emit in its tight loop now operates on register-resident e.bi_buf / e.bi_valid instead of reloading from s every call.

compress_block gained 2 stack slots for the e local (4 → 6 spills) but those are setup, not in the loop. Inside the loop, ld/st count is down 25%.

Whole-file size deltas

Object	base lines	head lines	Δ
`trees.o`	1930	1752	-178 (-9.2%)
`deflate_quick.o`	457	411	-46 (-10.1%)
`deflate.o`	2338	2336	-2
`deflate_stored.o`	360	358	-2

deflate_quick.o shrank ~10% even though deflate_quick.c itself is unchanged in this PR — the win comes from zng_tr_emit_tree / zng_tr_emit_end_block (called from quick_start_block / quick_end_block) now inlining the hoisted form.

Macro → static inline conversion

Verified codegen-equivalent: switching put_* macros to static inline deflate_emit_hot * functions produces only commutative orr operand swaps in trees.o and one 32-bit/64-bit neg form swap in deflate_quick.o. No instruction count, register pressure, or scheduling changes that affect the critical path. The conversion is a clarity/type-safety win, not a perf change.

Benchmark results

benchmark_zlib filtered to deflate_bench/deflate_level/{131072,1048576}/{1,3,6,9}, 5 repetitions with --benchmark_cooldown=5, aggregates only. CVs 0.46-1.5%.

Strategy	Level	Size	Δ wall	Δ CPU	Δ ratio
`deflate_quick`	1	128K	-1.18%	-1.01%	-0.01%
`deflate_quick`	1	1M	-2.03%	-2.04%	0.00%
`deflate_fast`	3	128K	-0.07%	+0.02%	-0.01%
`deflate_fast`	3	1M	-0.28%	-0.28%	0.00%
`deflate_medium`	6	128K	+1.86%	+1.68%	0.00%
`deflate_medium`	6	1M	+1.20%	+1.32%	-0.01%
`deflate_slow`	9	128K	-0.84%	-0.87%	-0.01%
`deflate_slow`	9	1M	-1.19%	-1.18%	0.00%
Geomean			-0.32%		-0.01%

Notable patterns:

Level 1 (quick) wins 1-2% even though deflate_quick.c is unchanged in this PR. The gain comes from zng_tr_emit_tree / zng_tr_emit_end_block inlining the hoisted form into quick_start_block / quick_end_block.
Level 9 (slow) wins ~1%.
Level 6 (medium) regresses 1.2-1.9% — the only consistent regression. Outside typical CV but small enough to be code-layout or register-allocation noise from the inline conversion.

Compression ratio is essentially identical across the board (±1 byte = gzip-header timestamp jitter).

Why the asm wins don't translate to bench wins

Bit-emit is a small fraction of compress_block total time. Most cycles go to Huffman lookup, hash insertion, and match-length compute — saving 11 of 130 instructions in the function is ~8% of static count but a much smaller fraction of dynamic critical-path latency.
The eliminated stores were going to the L1 cache / store buffer, which an out-of-order core hides cheaply.
send_tree runs once per Huffman block (~one per 16-32 KB of input on level 6), so its -58 instruction win amortizes thin across overall throughput.

Summary

Real codegen win on bit-emit hot paths (-25% to -61% ld/st in key functions).
Whole-file .o size down ~9-10% for trees.o and deflate_quick.o.
Levels 1, 3, 9: flat to ~2% wall-time improvement.
Level 6 (deflate_medium): consistent ~1.2-1.9% regression worth investigating.
Compression ratio unchanged across all levels.
Bonus: explicit type for bit-emit state, type-safe inline put_* helpers.

nmoinvaz/zlib-ng-deflate-struct-hoist-benefits.md

Select an option

No results found

Select an option

No results found

zlib-ng: deflate-struct-hoist — codegen and benchmark analysis

Change

Machine

Codegen impact (per-function, AArch64)

Hot functions in `trees.c`

Whole-file size deltas

Macro → static inline conversion

Benchmark results

Why the asm wins don't translate to bench wins

Summary

nmoinvaz/zlib-ng-deflate-struct-hoist-benefits.md

zlib-ng: deflate-struct-hoist — codegen and benchmark analysis

Change

Machine

Codegen impact (per-function, AArch64)

Hot functions in trees.c

Whole-file size deltas

Macro → static inline conversion

Benchmark results

Why the asm wins don't translate to bench wins

Summary

Hot functions in `trees.c`