Branch: nv/develop/deflate-struct-hoist
Commit: c435d01e — "Hoist deflate bit-emit state into deflate_emit_hot local"
Base: upstream/develop (48087450)
Introduces a deflate_emit_hot struct (bi_buf, bi_valid, pending_buf, pending) and DEFLATE_EMIT_HOT_LOAD/STORE macros that cache bit-emit state in a local at the top of hot loops and write it back once on exit. Converts put_byte/put_short/put_short_msb/put_uint32/put_uint32_msb/put_uint64 from macros to static inline functions taking a deflate_emit_hot *. Cold-path callers in deflate.c (zlib/gzip header and trailer writers, name/comment loops) and deflate_stored.c bracket their put_* clusters with LOAD/STORE.
Net diff: 5 files, +233 / -190.
- Apple M5 (10 physical / 10 logical cores), 32 GiB RAM
- macOS 26.4.1 (build 25E253)
- Apple clang 21.0.0
- CMake Release,
-D BUILD_SHARED_LIBS=OFF
Comparing .o output of upstream/develop (base) vs HEAD.
| Function | base ld/st | head ld/st | Δ ld/st | base insns | head insns | Δ insns |
|---|---|---|---|---|---|---|
compress_block |
43 | 32 | -25.6% | 130 | 119 | -8.5% |
send_tree |
96 | 37 | -61.5% | 298 | 240 | -19.5% |
zng_tr_flush_bits |
— | — | — | 71 | 39 | -45.1% |
send_tree lost 59 load/store instructions (~61% reduction) because each emit in its tight loop now operates on register-resident e.bi_buf / e.bi_valid instead of reloading from s every call.
compress_block gained 2 stack slots for the e local (4 → 6 spills) but those are setup, not in the loop. Inside the loop, ld/st count is down 25%.
| Object | base lines | head lines | Δ |
|---|---|---|---|
trees.o |
1930 | 1752 | -178 (-9.2%) |
deflate_quick.o |
457 | 411 | -46 (-10.1%) |
deflate.o |
2338 | 2336 | -2 |
deflate_stored.o |
360 | 358 | -2 |
deflate_quick.o shrank ~10% even though deflate_quick.c itself is unchanged in this PR — the win comes from zng_tr_emit_tree / zng_tr_emit_end_block (called from quick_start_block / quick_end_block) now inlining the hoisted form.
Verified codegen-equivalent: switching put_* macros to static inline deflate_emit_hot * functions produces only commutative orr operand swaps in trees.o and one 32-bit/64-bit neg form swap in deflate_quick.o. No instruction count, register pressure, or scheduling changes that affect the critical path. The conversion is a clarity/type-safety win, not a perf change.
benchmark_zlib filtered to deflate_bench/deflate_level/{131072,1048576}/{1,3,6,9}, 5 repetitions with --benchmark_cooldown=5, aggregates only. CVs 0.46-1.5%.
| Strategy | Level | Size | Δ wall | Δ CPU | Δ ratio |
|---|---|---|---|---|---|
deflate_quick |
1 | 128K | -1.18% | -1.01% | -0.01% |
deflate_quick |
1 | 1M | -2.03% | -2.04% | 0.00% |
deflate_fast |
3 | 128K | -0.07% | +0.02% | -0.01% |
deflate_fast |
3 | 1M | -0.28% | -0.28% | 0.00% |
deflate_medium |
6 | 128K | +1.86% | +1.68% | 0.00% |
deflate_medium |
6 | 1M | +1.20% | +1.32% | -0.01% |
deflate_slow |
9 | 128K | -0.84% | -0.87% | -0.01% |
deflate_slow |
9 | 1M | -1.19% | -1.18% | 0.00% |
| Geomean | -0.32% | -0.01% |
Notable patterns:
- Level 1 (quick) wins 1-2% even though
deflate_quick.cis unchanged in this PR. The gain comes fromzng_tr_emit_tree/zng_tr_emit_end_blockinlining the hoisted form intoquick_start_block/quick_end_block. - Level 9 (slow) wins ~1%.
- Level 6 (medium) regresses 1.2-1.9% — the only consistent regression. Outside typical CV but small enough to be code-layout or register-allocation noise from the inline conversion.
Compression ratio is essentially identical across the board (±1 byte = gzip-header timestamp jitter).
- Bit-emit is a small fraction of
compress_blocktotal time. Most cycles go to Huffman lookup, hash insertion, and match-length compute — saving 11 of 130 instructions in the function is ~8% of static count but a much smaller fraction of dynamic critical-path latency. - The eliminated stores were going to the L1 cache / store buffer, which an out-of-order core hides cheaply.
send_treeruns once per Huffman block (~one per 16-32 KB of input on level 6), so its -58 instruction win amortizes thin across overall throughput.
- Real codegen win on bit-emit hot paths (-25% to -61% ld/st in key functions).
- Whole-file
.osize down ~9-10% fortrees.oanddeflate_quick.o. - Levels 1, 3, 9: flat to ~2% wall-time improvement.
- Level 6 (
deflate_medium): consistent ~1.2-1.9% regression worth investigating. - Compression ratio unchanged across all levels.
- Bonus: explicit type for bit-emit state, type-safe inline put_* helpers.