Skip to content

Instantly share code, notes, and snippets.

@nmoinvaz
Created April 23, 2026 00:58
Show Gist options
  • Select an option

  • Save nmoinvaz/41f7368d579fee5f33d4c5b21a9e7ff9 to your computer and use it in GitHub Desktop.

Select an option

Save nmoinvaz/41f7368d579fee5f33d4c5b21a9e7ff9 to your computer and use it in GitHub Desktop.
zlib-ng zng_check_lens: SIMD vs SWAR vs scalar benchmark (spun out of PR #2267)

zng_check_lens: SIMD vs SWAR vs scalar

Validity-check benchmark for the zng_check_lens(lens, codes) function proposed in the PR #2267 discussion. All three variants scan lens[0..codes-1] and return -1 if any entry exceeds MAX_BITS (15). Input is all-valid (random values in [0, 15]) so the worst case — a full scan with no early exit — is measured.

Variants:

  • SIMD: SSE2 / NEON / AltiVec intrinsics, 8 uint16 per iter, _mm_subs_epu16 (or vcgtq_u16, vec_cmpgt) + OR accumulator, scalar tail for 0-7 remaining entries.
  • SWAR: zng_memread_8 + OR accumulator, 4 uint16 per iter, final AND 0xFFF0FFF0FFF0FFF0 checks any bit above bit 3.
  • Scalar: per-entry if (lens[i] > MAX_BITS) return -1;.

Benchmark harness in zlib-ng-check-lens-bench.cc — drop into test/benchmarks/ and add to BENCH_INTERNAL_SRCS in the benchmarks CMakeLists.txt. Note the benchmark::DoNotOptimize(lens) inside the measurement loop — without it, clang hoists the entire SIMD/SWAR scan out (the input never changes and the final branch is data-flow-pure) and reports constant ~0.45 ns at every size.

Results

Apple Silicon (M-series), clang, Release build, BUILD_SHARED_LIBS=OFF, WITH_MAINTAINER_WARNINGS=ON, 20 repetitions, 0.5s min time, 2s cooldown. Median CPU time.

codes SIMD (NEON) SWAR Scalar SWAR vs SIMD
19 1.81 ns 2.72 ns 9.81 ns +50%
30 1.81 ns 1.81 ns 15.3 ns tie
286 15.4 ns 6.47 ns 145 ns −58%

CV was under 3% on every row.

Reading the numbers

  • SWAR crushes scalar at every size (3.6× to 22× faster). Even the worst-case small-count SWAR is faster than scalar.

  • SIMD wins at small counts. At codes=19 SIMD is 33% faster because its fixed overhead (compare + OR per iter) amortizes over a single SIMD iteration, while SWAR's 4-wide inner loop needs four iterations plus a 3-entry scalar tail.

  • SWAR wins at large counts. At codes=286 SWAR is 2.4× faster. The SWAR inner loop has no per-iteration compare — just bad |= load with a single final mask test — so clang autovectorizes it into NEON with better throughput than the hand-written intrinsics, which force an explicit vcgtq_u16 every iteration.

Takeaway

Unlike the count_lengths case (where SWAR was 25-30× slower at small sizes due to the 16-iteration horizontal tail), SWAR for check_lens is competitive everywhere and wins outright at codes=286.

For the PR #2267 fix, SWAR is the better choice:

  • one function covers every architecture, no #if defined(__SSE2__) / ... fan-out
  • strictly faster than the scalar fallback, everywhere
  • only 1-2 ns slower than hand-written SIMD at the smallest sizes, and substantially faster at the largest one
  • the whole thing is ~10 lines

The SIMD intrinsics variant remains on record in the original gist https://gist.github.com/nmoinvaz/47889c1f1efcda7594281744aca957c5 for anyone who needs the best-possible small-count latency.

/* benchmark_check_lens.cc -- SIMD vs SWAR vs scalar validity check
* for a Huffman code-length buffer (see zlib-ng issue #2266, PR #2267).
*
* All variants scan lens[0..codes-1] and return -1 if any entry
* exceeds MAX_BITS. Input is all-valid so the worst case (full scan,
* no early exit) is measured.
*/
#include <stdio.h>
#include <stdlib.h>
#include <benchmark/benchmark.h>
extern "C" {
# include "zbuild.h"
# include "zutil.h"
# include "zmemory.h"
}
#if defined(__SSE2__)
# include "arch/x86/x86_intrins.h"
#elif defined(__ARM_NEON) || defined(__ARM_NEON__)
# include "arch/arm/neon_intrins.h"
#elif defined(__ALTIVEC__)
# include "arch/power/power_intrins.h"
#endif
static inline int check_lens_simd(const uint16_t *lens, unsigned codes) {
unsigned i = 0;
#if defined(__SSE2__)
__m128i max = _mm_set1_epi16(MAX_BITS);
__m128i bad = _mm_setzero_si128();
for (; i + 8 <= codes; i += 8) {
__m128i v = _mm_loadu_si128((const __m128i *)&lens[i]);
bad = _mm_or_si128(bad, _mm_subs_epu16(v, max));
}
if (_mm_movemask_epi8(_mm_cmpeq_epi8(bad, _mm_setzero_si128())) != 0xFFFF)
return -1;
#elif defined(__ARM_NEON) || defined(__ARM_NEON__)
uint16x8_t max = vdupq_n_u16(MAX_BITS);
uint16x8_t bad = vdupq_n_u16(0);
for (; i + 8 <= codes; i += 8) {
uint16x8_t v = vld1q_u16(&lens[i]);
bad = vorrq_u16(bad, vcgtq_u16(v, max));
}
# if defined(__aarch64__)
if (vmaxvq_u16(bad) != 0) return -1;
# else
{
uint16x4_t r = vorr_u16(vget_low_u16(bad), vget_high_u16(bad));
uint64_t s;
vst1_u64(&s, vreinterpret_u64_u16(r));
if (s) return -1;
}
# endif
#elif defined(__ALTIVEC__)
vector unsigned short max = vec_splats((uint16_t)MAX_BITS);
vector unsigned short bad = vec_splats((uint16_t)0);
for (; i + 8 <= codes; i += 8) {
vector unsigned short v = vec_vsx_ld(0, &lens[i]);
bad = vec_or(bad, (vector unsigned short)vec_cmpgt(v, max));
}
if (!vec_all_eq(bad, vec_splats((uint16_t)0))) return -1;
#endif
for (; i < codes; i++)
if (lens[i] > MAX_BITS) return -1;
return 0;
}
static inline int check_lens_swar(const uint16_t *lens, unsigned codes) {
uint64_t bad = 0;
unsigned i = 0;
for (; i + 4 <= codes; i += 4)
bad |= zng_memread_8(&lens[i]);
if (bad & 0xFFF0FFF0FFF0FFF0ULL) return -1;
for (; i < codes; i++)
if (lens[i] > MAX_BITS) return -1;
return 0;
}
static inline int check_lens_scalar(const uint16_t *lens, unsigned codes) {
for (unsigned i = 0; i < codes; i++)
if (lens[i] > MAX_BITS) return -1;
return 0;
}
static void gen_lens(uint16_t *lens, int codes, unsigned seed) {
uint32_t x = seed ? seed : 1;
for (int i = 0; i < codes; i++) {
x ^= x << 13; x ^= x >> 17; x ^= x << 5;
lens[i] = (uint16_t)(x & 0xF);
}
}
#define BENCH_FUNC(impl, codes_val) \
static void BM_check_lens_##impl##_##codes_val(benchmark::State &st) { \
uint16_t lens[320] __attribute__((aligned(16))); \
gen_lens(lens, codes_val, 0xC0DE ^ codes_val); \
for (auto _ : st) { \
benchmark::DoNotOptimize(lens); \
int r = check_lens_##impl(lens, codes_val); \
benchmark::DoNotOptimize(r); \
} \
st.SetItemsProcessed(st.iterations() * codes_val); \
} \
BENCHMARK(BM_check_lens_##impl##_##codes_val);
BENCH_FUNC(simd, 19)
BENCH_FUNC(swar, 19)
BENCH_FUNC(scalar, 19)
BENCH_FUNC(simd, 30)
BENCH_FUNC(swar, 30)
BENCH_FUNC(scalar, 30)
BENCH_FUNC(simd, 286)
BENCH_FUNC(swar, 286)
BENCH_FUNC(scalar, 286)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment