Validity-check benchmark for the zng_check_lens(lens, codes)
function proposed in the PR #2267 discussion. All three variants
scan lens[0..codes-1] and return -1 if any entry exceeds MAX_BITS
(15). Input is all-valid (random values in [0, 15]) so the worst
case — a full scan with no early exit — is measured.
Variants:
- SIMD: SSE2 / NEON / AltiVec intrinsics, 8 uint16 per iter,
_mm_subs_epu16(orvcgtq_u16,vec_cmpgt) + OR accumulator, scalar tail for 0-7 remaining entries. - SWAR:
zng_memread_8+ OR accumulator, 4 uint16 per iter, finalAND 0xFFF0FFF0FFF0FFF0checks any bit above bit 3. - Scalar: per-entry
if (lens[i] > MAX_BITS) return -1;.
Benchmark harness in zlib-ng-check-lens-bench.cc — drop into
test/benchmarks/ and add to BENCH_INTERNAL_SRCS in the
benchmarks CMakeLists.txt. Note the benchmark::DoNotOptimize(lens)
inside the measurement loop — without it, clang hoists the entire
SIMD/SWAR scan out (the input never changes and the final branch is
data-flow-pure) and reports constant ~0.45 ns at every size.
Apple Silicon (M-series), clang, Release build, BUILD_SHARED_LIBS=OFF,
WITH_MAINTAINER_WARNINGS=ON, 20 repetitions, 0.5s min time,
2s cooldown. Median CPU time.
| codes | SIMD (NEON) | SWAR | Scalar | SWAR vs SIMD |
|---|---|---|---|---|
| 19 | 1.81 ns | 2.72 ns | 9.81 ns | +50% |
| 30 | 1.81 ns | 1.81 ns | 15.3 ns | tie |
| 286 | 15.4 ns | 6.47 ns | 145 ns | −58% |
CV was under 3% on every row.
-
SWAR crushes scalar at every size (3.6× to 22× faster). Even the worst-case small-count SWAR is faster than scalar.
-
SIMD wins at small counts. At codes=19 SIMD is 33% faster because its fixed overhead (compare + OR per iter) amortizes over a single SIMD iteration, while SWAR's 4-wide inner loop needs four iterations plus a 3-entry scalar tail.
-
SWAR wins at large counts. At codes=286 SWAR is 2.4× faster. The SWAR inner loop has no per-iteration compare — just
bad |= loadwith a single final mask test — so clang autovectorizes it into NEON with better throughput than the hand-written intrinsics, which force an explicitvcgtq_u16every iteration.
Unlike the count_lengths case (where SWAR was 25-30× slower at small
sizes due to the 16-iteration horizontal tail), SWAR for check_lens
is competitive everywhere and wins outright at codes=286.
For the PR #2267 fix, SWAR is the better choice:
- one function covers every architecture, no
#if defined(__SSE2__) / ...fan-out - strictly faster than the scalar fallback, everywhere
- only 1-2 ns slower than hand-written SIMD at the smallest sizes, and substantially faster at the largest one
- the whole thing is ~10 lines
The SIMD intrinsics variant remains on record in the original gist https://gist.github.com/nmoinvaz/47889c1f1efcda7594281744aca957c5 for anyone who needs the best-possible small-count latency.