Skip to content

Instantly share code, notes, and snippets.

@corporatepiyush
Last active June 20, 2026 02:36
Show Gist options
  • Select an option

  • Save corporatepiyush/87f2d2af18436f62f7fb5387eff478c5 to your computer and use it in GitHub Desktop.

Select an option

Save corporatepiyush/87f2d2af18436f62f7fb5387eff478c5 to your computer and use it in GitHub Desktop.
AgentCompact

AGENT.md — Extreme Performance Reference (Compact Edition)

June 2026 · Java 25/26 · Go 1.26 · Rust 1.94 · Python 3.14 · Node 24 · Linux 7.0/6.18 LTS · PG 18.4 · MariaDB 12.3.2 LTS


Latency Numbers (2026 Hardware)

Operation Latency
L1 hit ~1 ns
L2 hit ~4 ns
L3 hit ~10–40 ns
DDR5 DRAM ~50–80 ns
HBM3e (on-package; latency ~20% HIGHER than DDR5 — advantage is bandwidth 1.2 TB/s/stack) ~100–150 ns
CXL 2.0/3.0 over PCIe ~120–200 ns
NVMe PCIe 5 seq read ~50–100 µs
NVMe 4K random read ~100–200 µs
Datacenter TCP RTT ~500 µs
Mutex uncontended ~20–40 ns
Mutex contended ~200–1000 ns
Context switch ~1–10 µs
syscall (ring 0) ~100–1000 ns
vDSO (clock_gettime) ~5–15 ns
CAS uncontended ~5–10 ns
CAS cross-core ~50–500 ns
AVX-512 op 16×f32 (Zen 5) ~0.5–1 ns
GPU kernel launch ~5–10 µs
Branch misprediction ~10–20 cycles
TLB miss 4 KB page ~100–1000 cycles
TLB miss 2 MB page ~20–100 cycles

Rule: L1 → DRAM gap is 60×. Design data layout around this.


Data Types & Memory

  • Smallest correct primitive: use uint8/16/int32 not int64/float64 when domain permits — packs more per cache line.
  • Intern strings at ingestion; replace comparisons with integer IDs in hot paths.
  • No interface{}/any/Object in hot paths — fat pointer + no inlining.
  • Fixed-point (int32 × 1000) for bounded monetary/sensor values; avoids FPU stalls.
  • Pack booleans: uint8 bitfield for 8 flags vs 8 bool fields = 7 bytes saved.
  • Cache line = 64 bytes on x86/ARM/RISC-V. Struct field order: largest → smallest, eliminates compiler padding.
  • False sharing: pad independently-written variables to 64 bytes (alignas(64), #[repr(align(64))]).
  • SoA over AoS when a loop touches only a subset of fields — packs relevant data contiguously.
  • AoSoA (blocks of 4–8 = SIMD width) for SIMD-heavy math kernels.
  • SIMD arrays: 32-byte aligned (AVX2), 64-byte aligned (AVX-512). posix_memalign, aligned_alloc, #[repr(align(64))].
  • Zero-copy parsing: store (offset, length) pairs into raw buffer. Only copy when ownership transfers across threads.
  • O_DIRECT / mmap buffers: align to 4096 bytes.
  • Prefetch: __builtin_prefetch(addr, 0, 3) 100–300 ns ahead of pointer-chasing loops.

Data Structures — Internals / Use When / Downsides

Hash Maps

Variant Internals Use When Downside
Swiss Tables (open addr) Flat array + 1-byte H2 metadata; SIMD _mm_cmpeq_epi8 checks 16 slots Short keys < 32B; Go 1.24+ built-in; C++ Abseil flat_hash_map Load factor cap ~70%; resize = O(N) copy spike
Chaining Array of linked-list bucket heads High load factor > 90%; variable-size keys Pointer chase = cache miss per element; alloc-heavy
Perfect hash Offline-computed MPH; 2–3 ns/key; zero collisions Static key sets (HTTP headers, opcodes) Cannot add keys post-build
Sharded map N independent shards; shard = hash(key) & (N-1) Concurrent writes; reduce lock contention by factor N Cross-shard ops require N locks; hot key still contends

Hash functions: integer keys → wyhash/xxHash3; strings < 64B → wyhash; long strings → xxHash3 SIMD; untrusted keys → SipHash-1-3.

Arrays / Collections

  • Dynamic array / Vec: (ptr, len, cap) triple; O(1) append within cap; resize = 2× copy. Pre-allocate: make([]T, 0, cap) / Vec::with_capacity(n) / new ArrayList<>(n).
  • Ring buffer: power-of-two size; head & (size-1) for wrap. Allocation-free FIFO. Use for SPSC I/O/event pipelines.
  • Bag (multiset): {element → count} hash map or sorted array + binary search.

Lock-Free

Structure Internals Use When Downside
SPSC queue Ring buffer; head/tail on separate cache lines; acquire/release only One producer, one consumer; 10–30 ns/op Strictly single-producer, single-consumer
Disruptor Pre-allocated ring; sequence numbers; CAS claim; wait strategies (busy-spin < yield < park) MPMC; tens of millions msg/sec; sub-100 ns Fixed ring size; slow consumer blocks fast producer
Sharded N independent instances; lock per shard Concurrent maps/caches/rate limiters Global iteration requires N locks

Memory ordering cost: x86 (TSO): acquire/release free; ARM/RISC-V: explicit barriers ~10–20 ns. Use weakest correct ordering.

Trees

Type Internals Use When Downside
B+Tree Multi-key cache-line nodes; leaves linked Sorted in-memory maps N > 1000; all DB storage engines Write amplification on node splits
ART Node4/16/48/256 by child count; SIMD byte-compare; path compression IP routing, prefix enum, sorted string maps 8+ bytes/node overhead; complex concurrent impl
Van Emde Boas BFS layout; subtree of height h in contiguous memory Static sorted tables queried many times Read-only after construction; unfamiliar API

Other Structures

Structure Use When Key Detail
Bloom filter (blocked) Pre-filter before expensive op; 10 bits/elem = 1% FPR Blocked variant: 1 cache line per query (io_uring_register_buf_ring)
Cuckoo filter Bloom + deletion support 2 bucket reads; load factor > 95% risks insert failure
Skip list Sorted concurrent access; lock-free per-node CAS O(log N); ~O(log N) pointer overhead vs B-tree
CSR graph offsets[V+1] + edges[E] flat arrays BFS/DFS/Dijkstra/PageRank; cache-optimal traversal
Columnar (Arrow) All values of column N contiguous Analytics: filter/sum/group touching few columns
Disruptor event bus Pre-alloc ring + per-subscriber sequence counter In-process pub/sub; millions events/sec

Control Flow

  • GVN/CSE: hoist loop-invariant loads into const-locals before loop. for(int i=0; i<obj->len; i++)const int len = obj->len; const T* d = obj->data; — one load vs N.
  • Loop unswitching: if a branch predicate is stable across all iterations, manually split into two loops. Each inner loop is clean and vectorizable.
  • Cold-path outlining: __attribute__((noinline, cold)) on all error handlers, assertions, rare branches. Keeps hot basic blocks dense in i-cache.
  • SROA: prefer flat local variables and structs that don't escape a function — compiler promotes them to registers.
  • Inlining: __attribute__((always_inline)) / #[inline(always)] for hot ≤ 20-instruction functions. #[cold] #[inline(never)] for error paths.
  • Loop unrolling: manual for inner loops < 8 iterations. #pragma GCC unroll N / #[cfg(target_feature = "avx2")].
  • Loop tiling: for 2D loops on matrices, tile to L1/L2 cache (tile size = sqrt(cache_size / element_size)).
  • Avoid virtual dispatch in hot paths: interface methods prevent inlining; use concrete types + monomorphization (Rust generics, C++ templates).
  • Devirtualization hint: call-site type hint or PGO feedback causes compiler to inline virtual calls.
  • Auto-vectorization triggers: unit-stride access, no pointer aliasing (restrict/noalias), loop counter known, no function calls inside loop, no breaks/continues.
  • LICM: if value doesn't change per iteration, it must not be inside the loop.
  • SLP vectorization: same operation on independent scalar chains → compiler packs into SIMD. Enable with -O3 -march=native.

Memory Management

Allocator Config (1/4 RAM preallocated, tiny+large mix)

jemalloc 5.3.1

MALLOC_CONF="narenas:$(nproc),tcache:true,tcache_max:32768,lg_tcache_nslots_mul:2,\
background_thread:true,dirty_decay_ms:5000,muzzy_decay_ms:10000,\
oversize_threshold:8388608,retain:true,metadata_thp:auto,thp:always"

Pre-warm: mallocx(RAM/4, MALLOCX_ARENA(0)|MALLOCX_TCACHE_NONE)memsetdallocx. With retain:true, freed extents stay in address space.

tcmalloc / gperftools 2.17

TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=$((512*1024*1024))  # 512 MB
TCMALLOC_RELEASE_RATE=0.1

Abseil TCMalloc programmatic: SetMaxTotalThreadCacheBytes(512<<20), SetBackgroundReleaseRate(0). Call MarkThreadIdle()/MarkThreadBusy() for pool threads.

mimalloc v3.3.2 (recommended default 2026)

MIMALLOC_RESERVE_HUGE_OS_PAGES=$((RAM_MB/2048))  # 1 GB pages
MIMALLOC_ALLOW_THP=1
MIMALLOC_PAGE_RESET=0        # no page reset on free (faster short-lived alloc reuse)
MIMALLOC_DECOMMIT_DELAY=500  # 500 ms before OS decommit
MIMALLOC_ARENA_EAGER_COMMIT=1

v3 per-request heap: mi_heap_new()mi_heap_malloc()mi_heap_destroy() (bulk free).

Choosing: mimalloc v3 → new projects; jemalloc → lowest p99 latency + mixed tiny/large; Abseil TCMalloc → extreme multi-thread throughput; jemalloc/mimalloc → TidesDB/RocksDB (crashes with glibc).

Arena / Object Pools

  • Per-request scratch arena: 64 KB blocks from pool; all allocs from bump pointer; single free at request end.
  • Object pools: pre-create N at startup; lock-free SPSC/MPMC free list; monitor high-water mark.
  • Pool objects: ByteBuffers, connection objects, parser instances, ZSTD_CCtx/ZSTD_DCtx.
  • Off-heap for JVM: MemorySegment (JDK 22+ Panama, stable); Arena.ofConfined() for scoped lifecycle.

GC — JVM (Java 25/26)

G1GC (default, general purpose)

-XX:+UseG1GC -Xms=Xmx -XX:MaxGCPauseMillis=100
-XX:G1HeapRegionSize=16m        # heap/2048; power of 2; 1–32 MB
-XX:G1NewSizePercent=20 -XX:G1MaxNewSizePercent=40
-XX:InitiatingHeapOccupancyPercent=40  # start concurrent mark earlier
-XX:G1ReservePercent=15
-XX:+G1UseAdaptiveIHOP
-XX:+AlwaysPreTouch -XX:+UseNUMA -XX:+UseTransparentHugePages

Java 25: remembered set merge → 2 GB→0.75 GB on 64 GB heap. Java 26 JEP 522: 15% throughput gain from reduced sync overhead.

ZGC (Java 25 = Generational only; -XX:+ZGenerational is no-op now)

-XX:+UseZGC -Xms=Xmx
-XX:SoftMaxHeapSize=<0.75*Xmx>  # normal-load cap; burst to Xmx
-XX:ZAllocationSpikeTolerance=2
-XX:ZUncommitDelay=300
-XX:ConcGCThreads=4
-XX:AOTCache=app.aot             # Java 26+: ZGC + AOT startup compatible
-XX:+AlwaysPreTouch -XX:+UseNUMA -XX:+UseTransparentHugePages

Java 25 Mapped Cache: replaces Page Cache, fixes inflated RSS, reduces fragmentation. ~5–10% throughput overhead vs G1. Needs ~10–15% more total memory.

Decision: G1 for REST APIs / web services (P99 < 100 ms); ZGC for latency-critical (P99 < 10 ms) or caches (16–200 GB); G1 relaxed for batch/ETL; ZGC for HFT + NUMA pin. Shenandoah and Parallel GC: still in JDK but not recommended for new deployments.

JFR always-on: jcmd <pid> JFR.start duration=60s filename=app.jfr settings=profile. Java 25 JEP 509: CPU-time profiling (settings=cpu-time).

GC — Node.js / V8 (Node 24 LTS)

# Heap structure: New Space (Nursery + To-Space) | Old Space | Code Space | Large Object
# --max-old-space-size limits ONLY JS heap, not Buffers/native/Code Space → RSS always higher

# High-throughput API (many short-lived objects):
node --max-old-space-size=2048 --max-semi-space-size=128 --max-code-cache-size=256 app.js

# Containerized (512 MB pod):
node --max-old-space-size-percentage=70 app.js   # Node 22+; adapts to any container size

# Data processing (large objects, small churn):
node --max-old-space-size=8192 --max-semi-space-size=32 app.js

Semi-space sizing: = (expected_alloc_per_request × 2 × concurrency) / 4. Larger semi-space → fewer Scavenges → less premature promotion. Monitor: v8.getHeapStatistics().used_heap_size / heap_size_limit > 0.85 → OOM risk.


I/O

io_uring (Linux 5.1+; liburing 2.9; PG18 io_method=io_uring)

⚠ Blocked by default in Docker ≥ 25 / containerd ≥ 2.0 seccomp + Google fleet. Ship an epoll fallback; in containers allowlist io_uring_setup/enter/register in a custom seccomp profile (never seccomp=unconfined).

  • Non-circular queue (Linux 7.0): better cache perf for IOPOLL; fixes mixed-device completion deferral.
  • cBPF filters (Linux 7.0): per-ring operation allow/deny in containers.
  • IORING_SETUP_SQPOLL: kernel polling thread → zero syscalls at sustained I/O.
  • IORING_OP_ACCEPT_MULTISHOT / IORING_OP_RECV_MULTISHOT: one SQE handles all connections/data arrivals.
  • IORING_OP_SEND_ZC (6.0+): zero-copy send; app notified when buffer safe to reuse.
  • io_uring_register_buf_ring() (5.19+): pre-registered buffers; kernel selects free buffer per recv.
  • IORING_OP_FUSE (6.14+): FUSE via io_uring, 20–40% FUSE latency reduction.
  • Ring resize (6.13+): io_uring_resize_rings() without reconnecting.

Libraries: liburing (C), tokio-uring 0.5/monoio 0.2 (Rust), Netty IOUring (Java), iouring-go (Go).

Buffer Management

  • Read buffer size: ≥ 1 full application frame (HTTP/2: 16 KB; binary: max PDU size).
  • Write buffering: TCP_CORK / MSG_MORE to hold until batch full. Then flush in one sendmsg.
  • Page-cache bypass: O_DIRECT for database-type I/O (aligned 4096, size multiple of block size).
  • mmap for large read-only datasets: OS page cache manages eviction; no read() syscalls.
  • MADV_WILLNEED: prefault pages before hot path. MADV_DONTNEED to release without freeing VA.

UDP GSO/GRO (Linux 4.18+ GSO; 5.0+ GRO; 6.2+ virtio TUN)

  • GSO: batch N datagrams into one super-packet; UDP_SEGMENT cmsg with per-segment size. One sendmsg() → N wire datagrams. Requires tx-udp-segmentation + tx-checksum-ip-generic in ethtool.
  • GRO: setsockopt(fd, SOL_UDP, UDP_GRO, &1, sizeof(int)); recvmsg returns super-packet + UDP_GRO cmsg with stride.
  • Fallback: sendmmsg/recvmmsg for environments without hardware GSO — still batches syscalls.
  • Result: Tailscale 10 Gb/s sustained (was 1–3 Gb/s pre-GSO/GRO). QUIC/HTTP3 uses same path.

Concurrency

  • SPSC: ring buffer, power-of-two size, head/tail on separate 64-byte cache lines, acquire/release only. 10–30 ns/op.
  • Disruptor: pre-allocated ring, sequence numbers, CAS claim, busy-spin wait. Sub-100 ns; tens of millions msg/s.
  • Sharding: N independent structures, shard = hash(key) & (N-1). Reduce contention by N. N = CPU cores × 4.
  • Atomic ordering (cheapest → costliest): relaxed → acquire/release → seq_cst. x86: acquire/release free (TSO); ARM: barriers ~10–20 ns.
  • False sharing: pad independently-written atomics/counters to 64 bytes.
  • Thread pool: size = CPU count for CPU-bound; higher for I/O-bound (benchmark). Avoid ThreadPerTask.
  • Work stealing: ForkJoinPool (Java), Tokio scheduler (Rust), Go runtime (M:N). Minimizes thread idle time.
  • rseq (restartable sequences, glibc 2.35+): per-CPU counters/freelists at plain load/store speed — no atomics, ever. Kernel restarts the section on preemption/migration. Abseil TCMalloc per-CPU mode is built on it.
  • NUMA: allocate memory on same NUMA node as threads that use it. numactl --membind=N --cpunodebind=N. GOMAXPROCS reads cgroup CPU quota in Go 1.25+.
  • Lock-free invariant: load(acquire) pairs with store(release). CAS loops: exponential backoff or yield before retry.

Networking & IPC

TCP/Socket Options

TCP_NODELAY=1        // disable Nagle; critical for request-response
SO_KEEPALIVE=1       // detect dead peers
SO_REUSEADDR=1       // fast restart
SO_REUSEPORT=1       // parallel accept across threads
SO_RCVBUF=4*1024*1024
SO_SNDBUF=4*1024*1024
SO_BUSY_POLL=50      // microseconds; avoids sleep/wakeup; uses CPU

Linux TCP: tcp_slow_start_after_idle=0, tcp_congestion_control=bbr, net.core.netdev_max_backlog=65536, net.ipv4.ip_local_port_range="1024 65535".

TLS 1.3

  • 1-RTT PSK resumption: server sends NewSessionTicket after handshake; client re-sends PSK in next ClientHello. Go: tls.Config.SetSessionTicketKeys() for distributed KEK sharing. Nginx: ssl_session_cache shared:SSL:50m; ssl_session_timeout 1d.
  • 0-RTT early data: client sends data with ClientHello — zero RTT. Use ONLY for idempotent ops (GET, reads). Anti-replay: Redis SET NX per ticket + ±5s time window.
  • Distributed KEK: rotate every 24h, accept old for 48h. Store in Vault/Secrets Manager. Ticket lifetime: 24h default, 1h for high-security APIs.
  • OCSP stapling: eliminates 100–500 ms CA round-trip on new sessions. Nginx: ssl_stapling on; ssl_stapling_verify on;.
  • 40% of Cloudflare TLS sessions are resumptions.

Zstd Custom Dictionary

zstd --train samples/* -o dict.zstd --maxdict=114688   # 112 KB dict

Hot path: ZSTD_createCDict_byReference() once; ZSTD_compress_usingCDict(cctx, ...) per call. No alloc. Similarly ZSTD_DDict + ZSTD_decompress_usingDDict. Zstd 1.5.7 (Feb 2026): 4.9× faster dict ops. Gains: 30–40× on API responses; 60–90× on similar-structure payloads (Roblox feature flags).


Logging, Metrics & Observability

  • Async thread-local logging: thread-local ring buffer → MPSC queue → single background writer. No sprintf in hot path; pre-format to fixed struct; format in background.
  • Level check: single atomic load before any formatting. DEBUG = no-op in production.
  • Rate limiting: emit first occurrence; suppress + summary every 1 s.
  • Metrics: per-CPU-core counters (no synchronization); aggregate at scrape time. HDR Histogram for latency; Prometheus format.
  • Tracing: W3C TraceContext (traceparent); 1% head sampling or tail-based (bias slow/error). OpenTelemetry → Jaeger/Tempo. Span overhead: skip per-iteration spans.
  • Continuous profiling: eBPF (Parca, Pyroscope); zero overhead when idle.

Architecture Principles

  • Monolith over microservices: in-process call < 10 ns vs network call 500 µs–5 ms. Shared memory 1000× faster than REST for same-host data.
  • No runtime reflection/DI in hot path: Spring/Guice startup cost, proxy overhead, megamorphic call sites. Use compile-time codegen (Quarkus, Micronaut, Dagger) or manual wiring.
  • Avoid dynamic dispatch in hot loops: virtual calls prevent inlining; cause megamorphic sites (> 3 types = 5–20× slower). Use concrete types + generics/templates (Rust impl Trait, C++ templates).
  • Codegen at build time: Protobuf/FlatBuffers, JOOQ/SQLc, ANTLR. Generated code is inlineable, statically typed, optimizable.
  • No DI containers in request path: context object pattern — pass RequestContext struct through call chain.

FFI & Native Interop

General Rules

  • Batch work across FFI boundary: one call with large array vs N calls × 1 element.
  • CGO cost: ~60–200 ns per call (goroutine stack switch). Batch.
  • Pass pre-allocated output buffers — avoid native code allocating and returning heap ptrs.
  • Off-heap exchange medium: no GC pinning needed.

Polars Shim (Universal SIMD Backend)

Pattern: Rust cdylib shim (polars crate) → extern "C" opaque handles (u64) → any language via its FFI. Arrow C Data Interface (ArrowSchema/ArrowArray structs) for zero-copy data transfer.

Key: polars_df_execute_plan(json_plan) — ship entire lazy query plan as JSON string; single FFI call. Polars does predicate pushdown + parallel execution.

Callers: Bun.js via bun:ffi (dlopen, FFIType.u64, ptr(TypedArray) = zero copy); Go via CGO.

ArrayFire (Universal GPU/SIMD Backend)

Access methods:

  1. Official language binding (pip install arrayfire, npm install arrayfire-js, cargo add arrayfire).
  2. Direct C ABI (libaf handles) via any FFI.
  3. Custom Rust cdylib shim for domain-fused ops (preserves JIT kernel fusion across FFI boundary).

Critical: chain ops inside shim to preserve JIT fusion. af_shim_matmul_sigmoid() = one GPU kernel; calling matmul then sigmoid separately = two kernels + VRAM roundtrip.

Backend order: CUDA → OpenCL → CPU (BLAS+AVX-512). set_backend(Backend::DEFAULT).


Hardware & SIMD

CPU Microarchitecture (2026)

Property Key Code implication
OoOE (ROB 200–600) Decouples issue from execute Expose independent ops; use multiple accumulators
Superscalar (4–8 ports/cycle) Mix port types in inner loops Interleave load/store/ALU to saturate all ports
Branch predictor (TAGE) > 99% accuracy regular patterns cmov/branchless for per-element data; arrange loop exits as "not taken"
Store-forwarding Same-address load after store: ~5 cycles Fails (10–15 cycle stall) on size/alignment mismatch
MLP (10–20 outstanding MSHR) Sequential access saturates; pointer-chase serializes Prefer flat arrays; issue __builtin_prefetch 200–400 ns ahead
I-cache / µop cache (1500–4000 µops) Hot loops must fit #[cold] / __attribute__((cold)) for error paths

2026 silicon: Zen 5 = full 512-bit AVX-512 datapath (4 native units; 40–50% uplift vs Zen 4). Intel Panther Lake = Intel 18A (GAA transistors + BSPDN), mobile-first, Q1 2026. AMD Zen 6 (EPYC Venice H2 2026) = TSMC 2nm N2P, AVX10.2, FP8.

SIMD ISA (2026)

ISA Width CPUs Key Ops
AVX2 256-bit All modern x86 8×f32/int; baseline SIMD target
AVX-512 512-bit Zen 4/5, Xeon 6 16×f32; mask registers; compress/expand
AVX10.1 512-bit Xeon 6 (Granite Rapids) Unified ISA: all AVX-512 sub-ext in one CPUID bit
AVX10.2 512-bit Xeon Diamond Rapids 2026, Zen 6 FP16/BF16 scalar+vector; OCP FP8 (E4M3/E5M2); IEEE NaN semantics
ARM SVE2 128–2048-bit (runtime) Cortex-X4, Graviton 4 Variable-width; one binary for all widths
ARM SME Cortex-X925, Apple M4 Matrix outer product; standardized AMX equivalent
Intel AMX 8 tiles (16×64 B each) Xeon 4/5/6 only 1024 MAC/instruction; ~512 INT8 TOPS/socket

Google Highway 1.2 (github.com/google/highway): one source → AVX2/AVX-512/AVX10/NEON/SVE2/WASM. HWY_DYNAMIC_DISPATCH. vqsort = 3–8× vs std::sort. Preferred over raw intrinsics.

AVX10.2 detection: CPUID leaf 24H, EAX=24H, ECX=0H → EBX[7:0] >= 2.


LLM Inference Serving

Decoding is memory-bandwidth-bound: tokens/sec ≈ mem_BW / bytes_per_token. 70B FP16 (140 GB) on 3.35 TB/s ≈ 24 tok/s single-stream ceiling; 4-bit (35 GB) ≈ 96 tok/s. This formula explains most inference behavior.

Technique Gain Tool
Continuous batching (evict/admit per decode step) 5–20× vs static batching vLLM, TGI, TensorRT-LLM
PagedAttention (KV cache in OS-style pages) Recovers 60–80% fragmented KV memory vLLM
Prefix caching / RadixAttention (shared system prompts) Large TTFT cut at high QPS SGLang
Speculative decoding (draft k tokens, verify in 1 pass) 2–3× decode, identical output dist EAGLE, Medusa
KV cache FP8/INT8 2× max batch size Hopper+/AVX10.2 FP8
Weight 4-bit (AWQ/GPTQ/FP4) ~4× decode throughput, ≤1% quality loss AWQ
Chunked prefill Long prompt no longer stalls decode latency vLLM/SGLang
CUDA Graphs decode capture Removes ~5–10 µs × N kernel-launch overhead/token TensorRT-LLM

TP (tensor parallel) within NVLink node; PP (pipeline) across nodes. Stacks: vLLM (default), SGLang (shared-prefix QPS), TensorRT-LLM (peak NVIDIA), llama.cpp (CPU/edge).

Protocol Parsing (Zero-Allocation State Machine)

  • Parser state = integer cursor variables in CPU registers only. No heap alloc in hot path.
  • Store (start: uint16, len: uint16) offsets into raw receive buffer — not copies.
  • ErrIncomplete: return without consuming bytes; caller accumulates in ring buffer + retries.
  • Bulk delimiter scan: bytes.IndexByte (Go) / memchr (C/Rust) — SIMD, skips 16–32 bytes/cycle vs 1.
  • No fmt.Sprintf, regex.Match, json.Unmarshal inside state machine. Identify boundaries first; validate semantics in app layer.
  • Pre-allocate output struct from pool; fill in-place; single-pass left-to-right; no backtracking.
  • Tiered codec pipeline: receive buffer → AES-NI decrypt → LZ4/Zstd decompress → state machine parser → frame struct. All buffers pre-allocated in graduated pools (4KB/16KB/64KB/256KB).

Database

PostgreSQL 18.4 (current) / PG19 Beta 1 (June 4, 2026)

PG18 AIO (headline feature):

io_method = worker            # default; all platforms
# io_method = io_uring        # Linux 5.1+; requires --with-liburing; fastest
io_workers = 8                # ~25% of CPU cores
effective_io_concurrency = 200   # NVMe; 20–50 SATA SSD; 1–4 HDD
maintenance_io_concurrency = 64
io_combine_limit = 128         # pages per AIO request; larger = higher throughput

Verify liburing: SELECT setting FROM pg_config() WHERE name='CONFIGURE' → grep liburing.

Other PG18: skip scan B-tree (multi-column index without leading eq predicate), parallel GIN builds, statistics preserved across pg_upgrade, uuidv7() (monotonic UUIDs, fewer B-tree splits), virtual generated columns, wire protocol v3.2.

PG19 Beta: SQL/PGQ native graph queries, ON CONFLICT DO SELECT (atomic get-or-create), parallel autovacuum, pg_plan_advice hints framework, online REPACK, 64-bit MultiXact, JIT off by default, WAIT FOR LSN on replicas.

Production postgresql.conf:

shared_buffers = 8GB               # 25% RAM
work_mem = 64MB                    # per sort/hash op (careful: × max_connections × 4)
maintenance_work_mem = 1GB         # VACUUM, CREATE INDEX
wal_compression = zstd             # 40–60% WAL reduction
max_wal_size = 8GB
checkpoint_completion_target = 0.9
synchronous_commit = off           # async commit; risk: ~200ms on crash
huge_pages = try

PostgreSQL Indexes

Type Use When Downside
B-Tree Most scalar columns; <, =, LIKE 'x%', BETWEEN Write amplification on splits
Partial WHERE active=true; low-cardinality flag — 5–50× smaller Query WHERE must syntactically imply index predicate (form must match)
Expression lower(email), date_trunc(...) Applies only to matching expression
GIN Full-text, JSONB @>, arrays &&; tsvector Expensive updates; enable fastupdate=on; VACUUM pressure
GIN parallel PG18: CREATE INDEX USING gin(...) uses parallel workers

Partition pruning works only when: partition key in WHERE, constant or bound param, enable_partition_pruning=on. Function call on partition key (date_trunc(...)) prevents pruning. Use pg_partman for auto-creation.

Partition strategies: RANGE (time-series), LIST (region/enum), HASH (uniform distribution). Each partition gets its own index — smaller and faster to maintain.

TimescaleDB

CREATE EXTENSION IF NOT EXISTS timescaledb;
SELECT create_hypertable('metrics', 'time', chunk_time_interval => INTERVAL '1 day');
-- Compression: 90–98% space saving on cold chunks
ALTER TABLE metrics SET (timescaledb.compress, timescaledb.compress_segmentby='device_id',
  timescaledb.compress_orderby='time DESC');
SELECT add_compression_policy('metrics', INTERVAL '7 days');
-- Continuous aggregate with auto-refresh
CREATE MATERIALIZED VIEW metrics_1h WITH (timescaledb.continuous) AS
  SELECT time_bucket('1h', time) AS bucket, device_id, avg(cpu) FROM metrics GROUP BY 1,2;
SELECT add_continuous_aggregate_policy('metrics_1h', '3 hours', '1 hour', '1 hour');
SELECT add_retention_policy('metrics', INTERVAL '90 days');

pgvector

CREATE EXTENSION vector;
-- HNSW (preferred): graph-based ANN, ~95% recall
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m=16, ef_construction=64);
SET hnsw.ef_search = 200;     -- nodes explored per query; higher = better recall
-- IVFFlat alternative: smaller index, ~90% recall
CREATE INDEX ON documents USING ivfflat (embedding vector_l2_ops) WITH (lists=100);
SET ivfflat.probes = 10;
-- Nearest-neighbor query
SELECT id FROM documents ORDER BY embedding <=> $1 LIMIT 10;
-- <=> cosine distance; <-> L2; <#> inner product

After failover: SELECT pg_prewarm('idx_embedding'). Per-partition HNSW for multi-tenant. work_mem=256MB, parallel_workers_per_gather=4 for heavy ANN loads.

BM25 Full-Text Search

pg_textsearch (Timescale, v1.0 March 2026, PostgreSQL-native pages, Block-Max WAND): CREATE INDEX USING bm25ts (content). Query: ORDER BY content <@> 'query' DESC. 2.4–6.5× faster than ParadeDB for 2–4 term queries at 138M docs; 8.7× higher concurrent throughput.

pg_search (ParadeDB v0.22.5, Rust/Tantivy, AGPL): CREATE INDEX USING bm25(id, desc, category) WITH (key_field='id'). Query: WHERE tbl @@@ paradedb.parse('desc:fast'). Supports faceted search, fuzzy (~2), highlight, hybrid BM25+pgvector.

Choose pg_textsearch for pure throughput/concurrency; pg_search for full Elasticsearch-style features.

Apache AGE (Graph Extension, PG 11–18)

LOAD 'age'; SET search_path = ag_catalog, "$user", public;
CREATE EXTENSION IF NOT EXISTS age;
SELECT create_graph('g');
SELECT * FROM cypher('g', $$ CREATE (:Person {name:'Alice'})-[:KNOWS]->(:Person {name:'Bob'}) $$) AS (v agtype);
SELECT * FROM cypher('g', $$ MATCH (a:Person)-[:KNOWS*1..3]->(b) RETURN b.name $$) AS (name agtype);

Hybrid SQL+Cypher: join cypher result with relational table. GIN index on properties column for agtype filter performance.

PostgreSQL 18 Primary–Replica Replication (Complete Setup)

# 1) PRIMARY postgresql.conf
wal_level=replica; max_wal_senders=10; wal_keep_size=1024
wal_compression=zstd; synchronous_commit=on; hot_standby=on
max_slot_wal_keep_size=4096          # cap WAL retention if replica goes offline (MB)
io_method=worker; io_workers=8       # PG18 AIO on WAL-sender path

# 2) PRIMARY: replication user + pg_hba.conf
psql -c "CREATE ROLE replicator WITH REPLICATION LOGIN ENCRYPTED PASSWORD 'pass';"
echo "host replication replicator 192.168.1.11/32 scram-sha-256" >> pg_hba.conf
psql -c "SELECT pg_create_physical_replication_slot('replica1_slot');"
systemctl reload postgresql@18-main

# 3) REPLICA: clone (writes standby.signal + primary_conninfo automatically)
systemctl stop postgresql@18-main && rm -rf /var/lib/postgresql/18/main
pg_basebackup --host=192.168.1.10 --username=replicator \
  --pgdata=/var/lib/postgresql/18/main \
  --wal-method=stream --write-recovery-conf --checkpoint=fast --progress

# 4) REPLICA postgresql.conf additions
hot_standby=on; hot_standby_feedback=on   # feedback stops primary VACUUM removing rows replica needs
max_standby_streaming_delay=30s
io_method=worker; io_workers=4
systemctl start postgresql@18-main
-- 5) Verify — on PRIMARY:
SELECT client_addr, state, sync_state,
       pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag
FROM pg_stat_replication;        -- expect state=streaming, lag=0 bytes
-- on REPLICA:
SELECT pg_is_in_recovery();      -- true
SELECT now() - pg_last_xact_replay_timestamp() AS delay;
-- Slot lag (growing = lagging replica blocks WAL cleanup):
SELECT slot_name, active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained
FROM pg_replication_slots;

Sync (zero data loss): synchronous_commit=remote_apply; synchronous_standby_names='FIRST 1 (replica1)' — adds 1 RTT per commit; remote_write for lower latency. Logical (cross-version, per-table): CREATE PUBLICATION p FOR TABLE t1,t2;CREATE SUBSCRIPTION s CONNECTION '...' PUBLICATION p WITH (streaming=on); Failover: SELECT pg_promote(); on the replica → repoint other replicas' primary_conninfo. Use Patroni for automation; PgBouncer in front of each tier.

MariaDB 12.3 LTS (May 29, 2026; supported June 2029)

InnoDB-Based Binary Log (headline feature):

log-bin
binlog-storage-engine = innodb   # stores binlog in InnoDB tablespace
                                  # eliminates 2PC; halves fsyncs; 4× write throughput
                                  # 2.4× faster single-thread; 50% commit latency reduction
                                  # crash-safe without sync overhead

# Safe with InnoDB binlog (was dangerous with file-based binlog):
innodb_flush_log_at_trx_commit = 2
sync_binlog = 0

Full my.cnf key params:

server_id = 1
binlog-storage-engine = innodb
gtid_domain_id = 1; gtid_strict_mode = ON
binlog_format = ROW; binlog_row_image = MINIMAL   # 50–80% smaller binlog
innodb_buffer_pool_size = 12G           # 70–80% RAM
innodb_buffer_pool_instances = 8        # one per GB pool
innodb_flush_method = O_DIRECT
innodb_io_capacity = 2000               # NVMe: 8000–20000
innodb_redo_log_capacity = 4G
thread_handling = pool-of-threads; thread_pool_size = 32
query_cache_type = 0; query_cache_size = 0   # disabled 10.6+

Other 12.3 features: native vector search in storage, JOIN_FIXED_ORDER, MAX_EXECUTION_TIME optimizer hints, caching_sha2_password (MySQL 8 compat), XML type, aria_pagecache_segments (1–128 for parallel Aria).

MariaDB 12.3 GTID Replication (Complete Setup)

# 1) PRIMARY /etc/mysql/mariadb.conf.d/50-server.cnf
[mariadb]
server_id=1
log-bin
log_basename=mariadb-primary
binlog-storage-engine=innodb     # 12.3: crash-safe binlog, no 2PC, 4× write throughput
gtid_domain_id=1; gtid_strict_mode=ON
binlog_format=ROW; binlog_row_image=MINIMAL   # 50–80% smaller binlog
sync_binlog=0                     # safe WITH InnoDB binlog (redo log covers it)
innodb_flush_log_at_trx_commit=2  # safe WITH InnoDB binlog
expire_logs_days=7

# 2) REPLICA config
[mariadb]
server_id=2                       # must be unique
log-bin; log_basename=mariadb-replica1
binlog-storage-engine=innodb
gtid_domain_id=1; gtid_strict_mode=ON
log_slave_updates=ON              # needed for chained replicas
read_only=ON; super_read_only=ON  # block even SUPER users from writing
relay_log=/var/log/mysql/relay-bin; relay_log_purge=ON
-- 3) PRIMARY: replication user
CREATE USER 'replicator'@'192.168.1.%' IDENTIFIED BY 'pass';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'192.168.1.%';
# 4) Backup primary (hot, no locks) → restore on replica
mariabackup --backup --target-dir=/backup/base --user=root --password=...
mariabackup --prepare --target-dir=/backup/base
grep gtid_binlog_pos /backup/base/xtrabackup_info     # e.g. 0-1-14728
# on replica: stop mariadb; rm -rf /var/lib/mysql/*; mariabackup --copy-back ...; chown -R mysql:mysql
-- 5) REPLICA: point at primary using GTID and start
SET GLOBAL gtid_slave_pos = '0-1-14728';
CHANGE MASTER TO MASTER_HOST='192.168.1.10', MASTER_USER='replicator',
  MASTER_PASSWORD='pass', MASTER_USE_GTID=slave_pos;
START SLAVE;
SHOW SLAVE STATUS\G   -- Slave_IO_Running=Yes, Slave_SQL_Running=Yes, Seconds_Behind_Master=0

-- 6) Failover: on promoted replica
STOP SLAVE; RESET SLAVE ALL;
SET GLOBAL read_only=OFF; SET GLOBAL super_read_only=OFF;
-- repoint others: CHANGE MASTER TO MASTER_HOST='new', MASTER_USE_GTID=current_pos; START SLAVE;

MaxScale for automation: module=mariadbmon monitor with auto_failover=true, auto_rejoin=true; router=readwritesplit service sends writes→primary, reads→replicas; apps connect to MaxScale listener (e.g. :4006) instead of 3306.

TidesDB v9.0.8 / TideSQL v4.2.4 (in MariaDB)

INSTALL SONAME 'ha_tidesdb';
CREATE TABLE events (...) ENGINE=TidesDB;

TidesDB vs RocksDB (NVMe, HammerDB TPC-C, Feb 2026): p50 GET 3 µs vs 4 µs; p99 GET 7 µs vs 12 µs; iteration 1.42× faster; storage 5.6× smaller. Write-heavy TPC-C: TidesDB wins; read-dominant: InnoDB wins. Stable with jemalloc; RocksDB crashes. v8.6: max_memory_usage field in tidesdb_config_t caps total engine footprint.

KV databases: TidesDB v9.0.8 (C, ACID+SSI, 5 isolation levels, column families, Kafka Streams drop-in); BadgerDB 4.7.0 (pure Go, WiscKey LSM, SSI, Dgraph/Jaeger/Pyroscope production).

Data Formats

Format Use When
Parquet Analytical queries; 5–20× smaller than JSON; row group 128 MB, ZSTD L3
Iceberg Parquet + ACID + time travel + partition evolution + O(partitions) pruning
DuckLake Iceberg semantics; catalog = DuckDB .db file; zero infra dependency
Delta Lake Databricks-native; Delta UniForm for cross-engine reads
Lance ML training data; O(1) random row access; native vector columns; zero-copy mmap→PyTorch
Arrow IPC/Flight In-process zero-copy; Flight SQL replaces JDBC (10–50× throughput)

Build & Toolchain

# Rust maximum performance
RUSTFLAGS="-C target-cpu=native -C opt-level=3 -C lto=fat -C codegen-units=1 -C panic=abort"
# LLD is default linker on x86_64 Linux since Rust 1.90 (2–4× faster than GNU ld)

# PGO (Profile-Guided Optimization): +10–20%
# 1. Build with instrumentation  2. Run production-like workload  3. Rebuild with profile
cargo-pgo build; cargo-pgo run; cargo-pgo optimize   # automates the workflow

# BOLT post-link (on top of PGO): additional 5–15%
cargo-bolt -- perf record -e cycles -j any,u -- ./app; cargo-bolt optimize

# GCC/Clang PGO
-fprofile-generate → run → -fprofile-use
-fprofile-partial-training   # for unexercised paths

# LTO (Link-Time Optimization)
-flto=full   # GCC/Clang; enables cross-TU inlining

# Java AOT cache (Leyden, PG18–26+, ZGC-compatible in Java 26)
java -XX:AOTCache:create=app.aot -jar app.jar          # training run
java -XX:AOTCache=app.aot -jar app.jar                 # production

OS Tuning (Linux 7.0 / 6.18 LTS)

Linux 7.0 (Apr 12, 2026; current: 7.0.10 May 23, 2026; 7.1 expected Jul 2026):

  • Rust stable in kernel (first-class drivers)
  • io_uring: non-circular queues (better cache), cBPF filters, IOPOLL completion fix
  • Rebuilt hybrid CPU scheduler: P-cores for latency-sensitive, E-cores for background — automatic, no cgroup config
  • Open Tree Namespace: faster container creation
  • XFS self-healing: runtime metadata repair without unmount
  • Lazy preemption default; Intel TSX enabled on newer chips
  • 7.1: in-kernel NTFS R/W, AMD XDNA v3, ARM SVE2/SME stabilization

Kernel parameters (sysctl):

vm.swappiness=1
vm.dirty_ratio=15; vm.dirty_background_ratio=5
net.core.netdev_max_backlog=65536
net.ipv4.ip_local_port_range="1024 65535"
net.ipv4.tcp_syncookies=1; net.ipv4.tcp_slow_start_after_idle=0
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag
echo 1048576 > /proc/sys/fs/file-max
ulimit -n 1048576

CPU isolation (latency-critical threads):

# GRUB: isolcpus=4-15 nohz_full=4-15 rcu_nocbs=4-15
taskset -c 4-15 ./app                 # pin to isolated cores

I/O scheduler: none (NVMe), mq-deadline (SATA SSD), bfq (mixed/HDD).

sched_ext (Linux 6.12+): BPF-programmable scheduler. Write custom scheduling policies in BPF; loaded at runtime without kernel recompile.

PREEMPT_LAZY (default in Linux 7.0): reduces unnecessary preemption while maintaining responsiveness. RT threads: chrt -f 99 ./app.

Huge pages: echo N > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages. THP: madvise mode (don't use always globally — causes latency spikes on anonymous alloc).

IRQ affinity: pin NIC interrupts to dedicated cores. cat /proc/interrupts | grep eth0echo 1 > /proc/irq/N/smp_affinity.

Mitigations cost: mitigations=off recovers 2–5% (compute) to 15–30%+ (syscall-heavy: DBs, proxies) on trusted-code-only hosts. Never on multi-tenant/untrusted-guest machines. Audit: lscpu | grep -A20 Vulnerab.

resctrl (Intel RDT / AMD QoS): partition shared L3 + throttle memory bandwidth per cgroup — kills noisy-neighbor p99 spikes. mount -t resctrl resctrl /sys/fs/resctrl; write way-masks to schemata (L3:0=ff0), throttle batch to MB:0=30; monitor llc_occupancy/mbm_total_bytes.

MGLRU (6.1+): generational page reclaim — better working-set detection, fewer refaults under pressure. echo y > /sys/kernel/mm/lru_gen/enabled. DAMON: kernel access-frequency monitoring + actions (proactive reclaim, hot-page THP collapse, CXL tier promotion) via damo.

6.x LTS reference: 6.12 (LTS, Dec 2026 EOL): sched_ext stable, PREEMPT_RT mainlined. 6.18 (LTS, Dec 2027 EOL): recommended server LTS. 6.19: last 6.x release.


Third-Party Library Versions (June 2026)

Allocators

  • mimalloc v3.3.2 (Jan 2026) · jemalloc 5.3.1 · Abseil TCMalloc (google/tcmalloc) · gperftools 2.17

Serialization

  • rkyv 0.8 (zero-copy Rust) · Cap'n Proto 1.0 · FlatBuffers 24.12 · protobuf 29.x · Apache Arrow 20.0

Hashing / Collections

  • Google Highway 1.2 · xxHash3 0.8.3 · wyhash 4.2 · Abseil flat_hash_map (2025-01) · RoaringBitmap 1.3 · DashMap 6.1 (Rust)

Concurrency / Networking

  • Tokio 1.51 LTS (MSRV 1.71; until Mar 2027) · Axum 0.8.x · quic-go 0.49 (UDP GSO/GRO) · Netty 4.2 · LMAX Disruptor 4.0 · Aeron 1.47

Compression

  • Zstd 1.5.7 (Feb 2026; dict 4.9× faster) · LZ4 1.10.0 · zlib-ng 2.2.4 (2–3× faster) · Brotli 1.1.0

Databases / Storage

  • TidesDB v9.0.8 · BadgerDB 4.7.0 · DuckDB 1.3 · Polars 1.24 · SQLite 3.46

Profiling

  • async-profiler 3.0 (Java) · Intel VTune 2025.0 · bpftrace 0.22 · Parca 0.19 · clinic.js 12.0 (Node.js) · perf (Linux 7.0)

Profiling Workflow

  1. CPU: perf record -g -F 999 -- ./appperf script | flamegraph.pl > flame.svg. Widest towers = hottest.
  2. Cache: perf stat -e cache-misses,L1-dcache-load-misses,LLC-load-misses ./app. LLC miss > 1% is significant.
  3. Memory: heaptrack (C++), async-profiler -e alloc (Java), pprof heap (Go), memray (Python).
  4. Locks: async-profiler -e lock (Java), pprof mutex (Go), perf lock (Linux).
  5. I/O: iostat -x 1 — watch await, %util. bpftrace -e 'tracepoint:block:block_rq_complete { @lat = hist(args->nr_sector); }'.
  6. Network: ss -s, netstat -s | grep retransmit, ethtool -S eth0 | grep error.

Benchmark discipline: 30+ runs, p50/p95/p99/p999, never mean-only. Coordinated omission: closed-loop generators (wrk, ab) under-report p99 by 10–1000× during server stalls — use open-loop wrk2 -R, Gatling constant-rate, or k6 constant-arrival-rate + HdrHistogram corrected recording. Environment: governor=performance, turbo off, ASLR off, taskset+numactl pinned. Warm JVM ≥ 100K iterations. Flame graph before any optimization.


Security Without Overhead

  • TLS 1.3 only (ssl_protocols TLSv1.3); session tickets for resumption; OCSP stapling.
  • AEAD ciphers (AES-256-GCM, ChaCha20-Poly1305) via AES-NI hardware — negligible CPU overhead.
  • Memory-safe defaults: bounds-checked slices, no raw pointer arithmetic without explicit unsafe.
  • scram-sha-256 for DB auth (MariaDB 12.3: caching_sha2_password).
  • Input validation at boundaries: check length/type/range at ingestion; don't repeat per function.
  • O_CLOEXEC on file descriptors; close extra fds before exec().
  • Separate processes/namespaces for secret handling; don't log secrets or tokens.

SIMD Search & Sort Algorithms

SIMD Binary Search — Hybrid Eytzinger + SIMD Terminal Scan

Standard binary search: cache-hostile (first 8 iterations each cause L3 miss on 4 MB array = ~320 ns).

Eytzinger layout: sort in BFS order (dst[k] = src[i] where k descends 1→2→4→...). Top 4 levels fit in 3 cache lines. Search: k = 2k + (a[k] < x); __builtin_prefetch(a + k16, 0, 0)` — ~40% faster than binary search for large N.

SIMD terminal scan (for final 32 elements): load 4×__m256i; _mm256_cmpgt_epi32 + _mm256_movemask_epi8; __builtin_popcount(mask)/4 = lower_bound index. 4 SIMD loads + 4 compares + popcount = ~2–4 cycles total.

Hybrid: Eytzinger descent (N → ~32) + SIMD linear scan = ~65% faster than std::lower_bound for N ∈ [512, 10K].

AVX-512 terminal scan: __mmask16 m = _mm512_cmplt_epi32_mask(v, vt); int pos = __builtin_popcount(m) — 16 elements per instruction, zero branches.

SIMD Sort Algorithms

Algorithm Library When Gain
pdqsort (pattern-defeating quicksort) Rust slice::sort_unstable, C++ std::ranges::sort Default for any comparison sort Already in stdlib; branchless 64-element partition
vqsort (Google Highway) hwy::VQSort(arr, n, hwy::SortAscending()) N > 1K; int32/int64/f32/f64 3–8× vs std::sort; AVX-512 compress/expand for partitioning
SIMD radix sort (LSD, 8-bit digits) ska_sort (C++), ips4o (parallel), rdxsort (Rust) Integer/float keys N > 100K O(N) theoretical; SIMD histogram build
Sort networks (fixed N) Manual intrinsics N = 4/8/16/32 (e.g., sorting SIMD registers, tuple keys) 8 f32 in ~6 cycles (vs ~40 cycles scalar insertion sort)
Cache-oblivious merge Block size = L1/2 × element_size Large arrays exceeding L2 Bottom-up; merge on L1-resident subarrays

NUMA & CXL Memory

  • NUMA allocation: numactl --membind=N --cpunodebind=N ./app. Linux CXL memory appears as additional NUMA node.
  • **mbind(addr, len, MPOL_BIND, &cxl_nodemask, ...) for CXL tier placement. MPOL_PREFERRED_MANY for DDR5-first, CXL fallback.
  • MemTis / TPP (Transparent Page Placement): auto-migrate pages DDR5↔CXL by access frequency.
  • CXL latency: ~120–200 ns (vs ~50–80 ns DDR5). Use for warm/cold data; keep hot data in local DDR5.
  • CXL 3.0: shared memory pool across multiple hosts via CXL switch — relevant for distributed caches.
  • Tools: lscpu | grep NUMA, numastat, numactl --hardware, ls /sys/bus/cxl/devices/.

Checklist Before Shipping

  • EXPLAIN (ANALYZE, BUFFERS) on every production query.
  • Index strategy reviewed: B-tree vs partial vs GIN; partition pruning verified.
  • Heap/GC configured: -Xms=Xmx, GC type chosen per workload, logging enabled.
  • Allocator chosen: mimalloc v3 / jemalloc; 1/4 RAM pre-warm in place.
  • io_uring enabled where applicable: PG18 io_method=io_uring, server I/O paths.
  • TLS: PSK resumption, OCSP stapling, distributed KEK rotation.
  • Compression: Zstd dictionary trained on representative samples for payloads < 4 KB.
  • Hot paths profiled: flame graph shows expected top functions, not surprises.
  • False sharing audit: shared counters/atomics padded to 64 bytes.
  • SIMD opportunities: inner loops auto-vectorizing? Check with -fopt-info-vec.
  • PGO applied to production binary (10–20% gain for free).
  • MALLOC_CONF / mimalloc env vars set for allocator.
  • sysctl applied: vm.swappiness=1, tcp_slow_start_after_idle=0, THP=madvise.
  • isolcpus set for latency-critical threads.
  • Benchmarks: 30 runs, p99 ≤ SLA, regression CI in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment