June 2026 · Java 25/26 · Go 1.26 · Rust 1.94 · Python 3.14 · Node 24 · Linux 7.0/6.18 LTS · PG 18.4 · MariaDB 12.3.2 LTS
| Operation | Latency |
|---|---|
| L1 hit | ~1 ns |
| L2 hit | ~4 ns |
| L3 hit | ~10–40 ns |
| DDR5 DRAM | ~50–80 ns |
| HBM3e (on-package; latency ~20% HIGHER than DDR5 — advantage is bandwidth 1.2 TB/s/stack) | ~100–150 ns |
| CXL 2.0/3.0 over PCIe | ~120–200 ns |
| NVMe PCIe 5 seq read | ~50–100 µs |
| NVMe 4K random read | ~100–200 µs |
| Datacenter TCP RTT | ~500 µs |
| Mutex uncontended | ~20–40 ns |
| Mutex contended | ~200–1000 ns |
| Context switch | ~1–10 µs |
| syscall (ring 0) | ~100–1000 ns |
| vDSO (clock_gettime) | ~5–15 ns |
| CAS uncontended | ~5–10 ns |
| CAS cross-core | ~50–500 ns |
| AVX-512 op 16×f32 (Zen 5) | ~0.5–1 ns |
| GPU kernel launch | ~5–10 µs |
| Branch misprediction | ~10–20 cycles |
| TLB miss 4 KB page | ~100–1000 cycles |
| TLB miss 2 MB page | ~20–100 cycles |
Rule: L1 → DRAM gap is 60×. Design data layout around this.
- Smallest correct primitive: use
uint8/16/int32notint64/float64when domain permits — packs more per cache line. - Intern strings at ingestion; replace comparisons with integer IDs in hot paths.
- No
interface{}/any/Objectin hot paths — fat pointer + no inlining. - Fixed-point (
int32 × 1000) for bounded monetary/sensor values; avoids FPU stalls. - Pack booleans:
uint8bitfield for 8 flags vs 8boolfields = 7 bytes saved. - Cache line = 64 bytes on x86/ARM/RISC-V. Struct field order: largest → smallest, eliminates compiler padding.
- False sharing: pad independently-written variables to 64 bytes (
alignas(64),#[repr(align(64))]). - SoA over AoS when a loop touches only a subset of fields — packs relevant data contiguously.
- AoSoA (blocks of 4–8 = SIMD width) for SIMD-heavy math kernels.
- SIMD arrays: 32-byte aligned (AVX2), 64-byte aligned (AVX-512).
posix_memalign,aligned_alloc,#[repr(align(64))]. - Zero-copy parsing: store
(offset, length)pairs into raw buffer. Only copy when ownership transfers across threads. O_DIRECT/mmapbuffers: align to 4096 bytes.- Prefetch:
__builtin_prefetch(addr, 0, 3)100–300 ns ahead of pointer-chasing loops.
| Variant | Internals | Use When | Downside |
|---|---|---|---|
| Swiss Tables (open addr) | Flat array + 1-byte H2 metadata; SIMD _mm_cmpeq_epi8 checks 16 slots |
Short keys < 32B; Go 1.24+ built-in; C++ Abseil flat_hash_map |
Load factor cap ~70%; resize = O(N) copy spike |
| Chaining | Array of linked-list bucket heads | High load factor > 90%; variable-size keys | Pointer chase = cache miss per element; alloc-heavy |
| Perfect hash | Offline-computed MPH; 2–3 ns/key; zero collisions | Static key sets (HTTP headers, opcodes) | Cannot add keys post-build |
| Sharded map | N independent shards; shard = hash(key) & (N-1) |
Concurrent writes; reduce lock contention by factor N | Cross-shard ops require N locks; hot key still contends |
Hash functions: integer keys → wyhash/xxHash3; strings < 64B → wyhash; long strings → xxHash3 SIMD; untrusted keys → SipHash-1-3.
- Dynamic array /
Vec:(ptr, len, cap)triple; O(1) append within cap; resize = 2× copy. Pre-allocate:make([]T, 0, cap)/Vec::with_capacity(n)/new ArrayList<>(n). - Ring buffer: power-of-two size;
head & (size-1)for wrap. Allocation-free FIFO. Use for SPSC I/O/event pipelines. - Bag (multiset):
{element → count}hash map or sorted array + binary search.
| Structure | Internals | Use When | Downside |
|---|---|---|---|
| SPSC queue | Ring buffer; head/tail on separate cache lines; acquire/release only |
One producer, one consumer; 10–30 ns/op | Strictly single-producer, single-consumer |
| Disruptor | Pre-allocated ring; sequence numbers; CAS claim; wait strategies (busy-spin < yield < park) | MPMC; tens of millions msg/sec; sub-100 ns | Fixed ring size; slow consumer blocks fast producer |
| Sharded | N independent instances; lock per shard | Concurrent maps/caches/rate limiters | Global iteration requires N locks |
Memory ordering cost: x86 (TSO): acquire/release free; ARM/RISC-V: explicit barriers ~10–20 ns. Use weakest correct ordering.
| Type | Internals | Use When | Downside |
|---|---|---|---|
| B+Tree | Multi-key cache-line nodes; leaves linked | Sorted in-memory maps N > 1000; all DB storage engines | Write amplification on node splits |
| ART | Node4/16/48/256 by child count; SIMD byte-compare; path compression | IP routing, prefix enum, sorted string maps | 8+ bytes/node overhead; complex concurrent impl |
| Van Emde Boas | BFS layout; subtree of height h in contiguous memory | Static sorted tables queried many times | Read-only after construction; unfamiliar API |
| Structure | Use When | Key Detail |
|---|---|---|
| Bloom filter (blocked) | Pre-filter before expensive op; 10 bits/elem = 1% FPR | Blocked variant: 1 cache line per query (io_uring_register_buf_ring) |
| Cuckoo filter | Bloom + deletion support | 2 bucket reads; load factor > 95% risks insert failure |
| Skip list | Sorted concurrent access; lock-free per-node CAS | O(log N); ~O(log N) pointer overhead vs B-tree |
| CSR graph | offsets[V+1] + edges[E] flat arrays |
BFS/DFS/Dijkstra/PageRank; cache-optimal traversal |
| Columnar (Arrow) | All values of column N contiguous | Analytics: filter/sum/group touching few columns |
| Disruptor event bus | Pre-alloc ring + per-subscriber sequence counter | In-process pub/sub; millions events/sec |
- GVN/CSE: hoist loop-invariant loads into const-locals before loop.
for(int i=0; i<obj->len; i++)→const int len = obj->len; const T* d = obj->data;— one load vs N. - Loop unswitching: if a branch predicate is stable across all iterations, manually split into two loops. Each inner loop is clean and vectorizable.
- Cold-path outlining:
__attribute__((noinline, cold))on all error handlers, assertions, rare branches. Keeps hot basic blocks dense in i-cache. - SROA: prefer flat local variables and structs that don't escape a function — compiler promotes them to registers.
- Inlining:
__attribute__((always_inline))/#[inline(always)]for hot ≤ 20-instruction functions.#[cold] #[inline(never)]for error paths. - Loop unrolling: manual for inner loops < 8 iterations.
#pragma GCC unroll N/#[cfg(target_feature = "avx2")]. - Loop tiling: for 2D loops on matrices, tile to L1/L2 cache (tile size =
sqrt(cache_size / element_size)). - Avoid virtual dispatch in hot paths: interface methods prevent inlining; use concrete types + monomorphization (Rust generics, C++ templates).
- Devirtualization hint: call-site type hint or PGO feedback causes compiler to inline virtual calls.
- Auto-vectorization triggers: unit-stride access, no pointer aliasing (
restrict/noalias), loop counter known, no function calls inside loop, no breaks/continues. - LICM: if value doesn't change per iteration, it must not be inside the loop.
- SLP vectorization: same operation on independent scalar chains → compiler packs into SIMD. Enable with
-O3 -march=native.
jemalloc 5.3.1
MALLOC_CONF="narenas:$(nproc),tcache:true,tcache_max:32768,lg_tcache_nslots_mul:2,\
background_thread:true,dirty_decay_ms:5000,muzzy_decay_ms:10000,\
oversize_threshold:8388608,retain:true,metadata_thp:auto,thp:always"
Pre-warm: mallocx(RAM/4, MALLOCX_ARENA(0)|MALLOCX_TCACHE_NONE) → memset → dallocx. With retain:true, freed extents stay in address space.
tcmalloc / gperftools 2.17
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=$((512*1024*1024)) # 512 MB
TCMALLOC_RELEASE_RATE=0.1
Abseil TCMalloc programmatic: SetMaxTotalThreadCacheBytes(512<<20), SetBackgroundReleaseRate(0). Call MarkThreadIdle()/MarkThreadBusy() for pool threads.
mimalloc v3.3.2 (recommended default 2026)
MIMALLOC_RESERVE_HUGE_OS_PAGES=$((RAM_MB/2048)) # 1 GB pages
MIMALLOC_ALLOW_THP=1
MIMALLOC_PAGE_RESET=0 # no page reset on free (faster short-lived alloc reuse)
MIMALLOC_DECOMMIT_DELAY=500 # 500 ms before OS decommit
MIMALLOC_ARENA_EAGER_COMMIT=1
v3 per-request heap: mi_heap_new() → mi_heap_malloc() → mi_heap_destroy() (bulk free).
Choosing: mimalloc v3 → new projects; jemalloc → lowest p99 latency + mixed tiny/large; Abseil TCMalloc → extreme multi-thread throughput; jemalloc/mimalloc → TidesDB/RocksDB (crashes with glibc).
- Per-request scratch arena: 64 KB blocks from pool; all allocs from bump pointer; single free at request end.
- Object pools: pre-create N at startup; lock-free SPSC/MPMC free list; monitor high-water mark.
- Pool objects: ByteBuffers, connection objects, parser instances, ZSTD_CCtx/ZSTD_DCtx.
- Off-heap for JVM:
MemorySegment(JDK 22+ Panama, stable);Arena.ofConfined()for scoped lifecycle.
G1GC (default, general purpose)
-XX:+UseG1GC -Xms=Xmx -XX:MaxGCPauseMillis=100
-XX:G1HeapRegionSize=16m # heap/2048; power of 2; 1–32 MB
-XX:G1NewSizePercent=20 -XX:G1MaxNewSizePercent=40
-XX:InitiatingHeapOccupancyPercent=40 # start concurrent mark earlier
-XX:G1ReservePercent=15
-XX:+G1UseAdaptiveIHOP
-XX:+AlwaysPreTouch -XX:+UseNUMA -XX:+UseTransparentHugePages
Java 25: remembered set merge → 2 GB→0.75 GB on 64 GB heap. Java 26 JEP 522: 15% throughput gain from reduced sync overhead.
ZGC (Java 25 = Generational only; -XX:+ZGenerational is no-op now)
-XX:+UseZGC -Xms=Xmx
-XX:SoftMaxHeapSize=<0.75*Xmx> # normal-load cap; burst to Xmx
-XX:ZAllocationSpikeTolerance=2
-XX:ZUncommitDelay=300
-XX:ConcGCThreads=4
-XX:AOTCache=app.aot # Java 26+: ZGC + AOT startup compatible
-XX:+AlwaysPreTouch -XX:+UseNUMA -XX:+UseTransparentHugePages
Java 25 Mapped Cache: replaces Page Cache, fixes inflated RSS, reduces fragmentation. ~5–10% throughput overhead vs G1. Needs ~10–15% more total memory.
Decision: G1 for REST APIs / web services (P99 < 100 ms); ZGC for latency-critical (P99 < 10 ms) or caches (16–200 GB); G1 relaxed for batch/ETL; ZGC for HFT + NUMA pin. Shenandoah and Parallel GC: still in JDK but not recommended for new deployments.
JFR always-on: jcmd <pid> JFR.start duration=60s filename=app.jfr settings=profile. Java 25 JEP 509: CPU-time profiling (settings=cpu-time).
# Heap structure: New Space (Nursery + To-Space) | Old Space | Code Space | Large Object
# --max-old-space-size limits ONLY JS heap, not Buffers/native/Code Space → RSS always higher
# High-throughput API (many short-lived objects):
node --max-old-space-size=2048 --max-semi-space-size=128 --max-code-cache-size=256 app.js
# Containerized (512 MB pod):
node --max-old-space-size-percentage=70 app.js # Node 22+; adapts to any container size
# Data processing (large objects, small churn):
node --max-old-space-size=8192 --max-semi-space-size=32 app.js
Semi-space sizing: = (expected_alloc_per_request × 2 × concurrency) / 4. Larger semi-space → fewer Scavenges → less premature promotion. Monitor: v8.getHeapStatistics().used_heap_size / heap_size_limit > 0.85 → OOM risk.
⚠ Blocked by default in Docker ≥ 25 / containerd ≥ 2.0 seccomp + Google fleet. Ship an epoll fallback; in containers allowlist io_uring_setup/enter/register in a custom seccomp profile (never seccomp=unconfined).
- Non-circular queue (Linux 7.0): better cache perf for IOPOLL; fixes mixed-device completion deferral.
- cBPF filters (Linux 7.0): per-ring operation allow/deny in containers.
IORING_SETUP_SQPOLL: kernel polling thread → zero syscalls at sustained I/O.IORING_OP_ACCEPT_MULTISHOT/IORING_OP_RECV_MULTISHOT: one SQE handles all connections/data arrivals.IORING_OP_SEND_ZC(6.0+): zero-copy send; app notified when buffer safe to reuse.io_uring_register_buf_ring()(5.19+): pre-registered buffers; kernel selects free buffer per recv.IORING_OP_FUSE(6.14+): FUSE via io_uring, 20–40% FUSE latency reduction.- Ring resize (6.13+):
io_uring_resize_rings()without reconnecting.
Libraries: liburing (C), tokio-uring 0.5/monoio 0.2 (Rust), Netty IOUring (Java), iouring-go (Go).
- Read buffer size: ≥ 1 full application frame (HTTP/2: 16 KB; binary: max PDU size).
- Write buffering:
TCP_CORK/MSG_MOREto hold until batch full. Then flush in onesendmsg. - Page-cache bypass:
O_DIRECTfor database-type I/O (aligned 4096, size multiple of block size). - mmap for large read-only datasets: OS page cache manages eviction; no
read()syscalls. MADV_WILLNEED: prefault pages before hot path.MADV_DONTNEEDto release without freeing VA.
- GSO: batch N datagrams into one super-packet;
UDP_SEGMENTcmsg with per-segment size. Onesendmsg()→ N wire datagrams. Requirestx-udp-segmentation+tx-checksum-ip-genericinethtool. - GRO:
setsockopt(fd, SOL_UDP, UDP_GRO, &1, sizeof(int));recvmsgreturns super-packet +UDP_GROcmsg with stride. - Fallback:
sendmmsg/recvmmsgfor environments without hardware GSO — still batches syscalls. - Result: Tailscale 10 Gb/s sustained (was 1–3 Gb/s pre-GSO/GRO). QUIC/HTTP3 uses same path.
- SPSC: ring buffer, power-of-two size, head/tail on separate 64-byte cache lines, acquire/release only. 10–30 ns/op.
- Disruptor: pre-allocated ring, sequence numbers, CAS claim, busy-spin wait. Sub-100 ns; tens of millions msg/s.
- Sharding: N independent structures,
shard = hash(key) & (N-1). Reduce contention by N. N = CPU cores × 4. - Atomic ordering (cheapest → costliest): relaxed → acquire/release → seq_cst. x86: acquire/release free (TSO); ARM: barriers ~10–20 ns.
- False sharing: pad independently-written atomics/counters to 64 bytes.
- Thread pool: size = CPU count for CPU-bound; higher for I/O-bound (benchmark). Avoid
ThreadPerTask. - Work stealing:
ForkJoinPool(Java), Tokio scheduler (Rust), Go runtime (M:N). Minimizes thread idle time. - rseq (restartable sequences, glibc 2.35+): per-CPU counters/freelists at plain load/store speed — no atomics, ever. Kernel restarts the section on preemption/migration. Abseil TCMalloc per-CPU mode is built on it.
- NUMA: allocate memory on same NUMA node as threads that use it.
numactl --membind=N --cpunodebind=N.GOMAXPROCSreads cgroup CPU quota in Go 1.25+. - Lock-free invariant:
load(acquire)pairs withstore(release). CAS loops: exponential backoff or yield before retry.
TCP_NODELAY=1 // disable Nagle; critical for request-response
SO_KEEPALIVE=1 // detect dead peers
SO_REUSEADDR=1 // fast restart
SO_REUSEPORT=1 // parallel accept across threads
SO_RCVBUF=4*1024*1024
SO_SNDBUF=4*1024*1024
SO_BUSY_POLL=50 // microseconds; avoids sleep/wakeup; uses CPULinux TCP: tcp_slow_start_after_idle=0, tcp_congestion_control=bbr, net.core.netdev_max_backlog=65536, net.ipv4.ip_local_port_range="1024 65535".
- 1-RTT PSK resumption: server sends
NewSessionTicketafter handshake; client re-sends PSK in nextClientHello. Go:tls.Config.SetSessionTicketKeys()for distributed KEK sharing. Nginx:ssl_session_cache shared:SSL:50m; ssl_session_timeout 1d. - 0-RTT early data: client sends data with
ClientHello— zero RTT. Use ONLY for idempotent ops (GET, reads). Anti-replay: RedisSET NXper ticket + ±5s time window. - Distributed KEK: rotate every 24h, accept old for 48h. Store in Vault/Secrets Manager. Ticket lifetime: 24h default, 1h for high-security APIs.
- OCSP stapling: eliminates 100–500 ms CA round-trip on new sessions. Nginx:
ssl_stapling on; ssl_stapling_verify on;. - 40% of Cloudflare TLS sessions are resumptions.
zstd --train samples/* -o dict.zstd --maxdict=114688 # 112 KB dictHot path: ZSTD_createCDict_byReference() once; ZSTD_compress_usingCDict(cctx, ...) per call. No alloc. Similarly ZSTD_DDict + ZSTD_decompress_usingDDict. Zstd 1.5.7 (Feb 2026): 4.9× faster dict ops. Gains: 30–40× on API responses; 60–90× on similar-structure payloads (Roblox feature flags).
- Async thread-local logging: thread-local ring buffer → MPSC queue → single background writer. No
sprintfin hot path; pre-format to fixed struct; format in background. - Level check: single atomic load before any formatting. DEBUG = no-op in production.
- Rate limiting: emit first occurrence; suppress + summary every 1 s.
- Metrics: per-CPU-core counters (no synchronization); aggregate at scrape time. HDR Histogram for latency; Prometheus format.
- Tracing: W3C TraceContext (
traceparent); 1% head sampling or tail-based (bias slow/error). OpenTelemetry → Jaeger/Tempo. Span overhead: skip per-iteration spans. - Continuous profiling: eBPF (Parca, Pyroscope); zero overhead when idle.
- Monolith over microservices: in-process call < 10 ns vs network call 500 µs–5 ms. Shared memory 1000× faster than REST for same-host data.
- No runtime reflection/DI in hot path: Spring/Guice startup cost, proxy overhead, megamorphic call sites. Use compile-time codegen (Quarkus, Micronaut, Dagger) or manual wiring.
- Avoid dynamic dispatch in hot loops: virtual calls prevent inlining; cause megamorphic sites (> 3 types = 5–20× slower). Use concrete types + generics/templates (Rust
impl Trait, C++ templates). - Codegen at build time: Protobuf/FlatBuffers, JOOQ/SQLc, ANTLR. Generated code is inlineable, statically typed, optimizable.
- No DI containers in request path: context object pattern — pass
RequestContextstruct through call chain.
- Batch work across FFI boundary: one call with large array vs N calls × 1 element.
- CGO cost: ~60–200 ns per call (goroutine stack switch). Batch.
- Pass pre-allocated output buffers — avoid native code allocating and returning heap ptrs.
- Off-heap exchange medium: no GC pinning needed.
Pattern: Rust cdylib shim (polars crate) → extern "C" opaque handles (u64) → any language via its FFI. Arrow C Data Interface (ArrowSchema/ArrowArray structs) for zero-copy data transfer.
Key: polars_df_execute_plan(json_plan) — ship entire lazy query plan as JSON string; single FFI call. Polars does predicate pushdown + parallel execution.
Callers: Bun.js via bun:ffi (dlopen, FFIType.u64, ptr(TypedArray) = zero copy); Go via CGO.
Access methods:
- Official language binding (
pip install arrayfire,npm install arrayfire-js,cargo add arrayfire). - Direct C ABI (
libafhandles) via any FFI. - Custom Rust
cdylibshim for domain-fused ops (preserves JIT kernel fusion across FFI boundary).
Critical: chain ops inside shim to preserve JIT fusion. af_shim_matmul_sigmoid() = one GPU kernel; calling matmul then sigmoid separately = two kernels + VRAM roundtrip.
Backend order: CUDA → OpenCL → CPU (BLAS+AVX-512). set_backend(Backend::DEFAULT).
| Property | Key | Code implication |
|---|---|---|
| OoOE (ROB 200–600) | Decouples issue from execute | Expose independent ops; use multiple accumulators |
| Superscalar (4–8 ports/cycle) | Mix port types in inner loops | Interleave load/store/ALU to saturate all ports |
| Branch predictor (TAGE) | > 99% accuracy regular patterns | cmov/branchless for per-element data; arrange loop exits as "not taken" |
| Store-forwarding | Same-address load after store: ~5 cycles | Fails (10–15 cycle stall) on size/alignment mismatch |
| MLP (10–20 outstanding MSHR) | Sequential access saturates; pointer-chase serializes | Prefer flat arrays; issue __builtin_prefetch 200–400 ns ahead |
| I-cache / µop cache (1500–4000 µops) | Hot loops must fit | #[cold] / __attribute__((cold)) for error paths |
2026 silicon: Zen 5 = full 512-bit AVX-512 datapath (4 native units; 40–50% uplift vs Zen 4). Intel Panther Lake = Intel 18A (GAA transistors + BSPDN), mobile-first, Q1 2026. AMD Zen 6 (EPYC Venice H2 2026) = TSMC 2nm N2P, AVX10.2, FP8.
| ISA | Width | CPUs | Key Ops |
|---|---|---|---|
| AVX2 | 256-bit | All modern x86 | 8×f32/int; baseline SIMD target |
| AVX-512 | 512-bit | Zen 4/5, Xeon 6 | 16×f32; mask registers; compress/expand |
| AVX10.1 | 512-bit | Xeon 6 (Granite Rapids) | Unified ISA: all AVX-512 sub-ext in one CPUID bit |
| AVX10.2 | 512-bit | Xeon Diamond Rapids 2026, Zen 6 | FP16/BF16 scalar+vector; OCP FP8 (E4M3/E5M2); IEEE NaN semantics |
| ARM SVE2 | 128–2048-bit (runtime) | Cortex-X4, Graviton 4 | Variable-width; one binary for all widths |
| ARM SME | — | Cortex-X925, Apple M4 | Matrix outer product; standardized AMX equivalent |
| Intel AMX | 8 tiles (16×64 B each) | Xeon 4/5/6 only | 1024 MAC/instruction; ~512 INT8 TOPS/socket |
Google Highway 1.2 (github.com/google/highway): one source → AVX2/AVX-512/AVX10/NEON/SVE2/WASM. HWY_DYNAMIC_DISPATCH. vqsort = 3–8× vs std::sort. Preferred over raw intrinsics.
AVX10.2 detection: CPUID leaf 24H, EAX=24H, ECX=0H → EBX[7:0] >= 2.
Decoding is memory-bandwidth-bound: tokens/sec ≈ mem_BW / bytes_per_token. 70B FP16 (140 GB) on 3.35 TB/s ≈ 24 tok/s single-stream ceiling; 4-bit (35 GB) ≈ 96 tok/s. This formula explains most inference behavior.
| Technique | Gain | Tool |
|---|---|---|
| Continuous batching (evict/admit per decode step) | 5–20× vs static batching | vLLM, TGI, TensorRT-LLM |
| PagedAttention (KV cache in OS-style pages) | Recovers 60–80% fragmented KV memory | vLLM |
| Prefix caching / RadixAttention (shared system prompts) | Large TTFT cut at high QPS | SGLang |
| Speculative decoding (draft k tokens, verify in 1 pass) | 2–3× decode, identical output dist | EAGLE, Medusa |
| KV cache FP8/INT8 | 2× max batch size | Hopper+/AVX10.2 FP8 |
| Weight 4-bit (AWQ/GPTQ/FP4) | ~4× decode throughput, ≤1% quality loss | AWQ |
| Chunked prefill | Long prompt no longer stalls decode latency | vLLM/SGLang |
| CUDA Graphs decode capture | Removes ~5–10 µs × N kernel-launch overhead/token | TensorRT-LLM |
TP (tensor parallel) within NVLink node; PP (pipeline) across nodes. Stacks: vLLM (default), SGLang (shared-prefix QPS), TensorRT-LLM (peak NVIDIA), llama.cpp (CPU/edge).
- Parser state = integer cursor variables in CPU registers only. No heap alloc in hot path.
- Store
(start: uint16, len: uint16)offsets into raw receive buffer — not copies. ErrIncomplete: return without consuming bytes; caller accumulates in ring buffer + retries.- Bulk delimiter scan:
bytes.IndexByte(Go) /memchr(C/Rust) — SIMD, skips 16–32 bytes/cycle vs 1. - No
fmt.Sprintf,regex.Match,json.Unmarshalinside state machine. Identify boundaries first; validate semantics in app layer. - Pre-allocate output struct from pool; fill in-place; single-pass left-to-right; no backtracking.
- Tiered codec pipeline:
receive buffer → AES-NI decrypt → LZ4/Zstd decompress → state machine parser → frame struct. All buffers pre-allocated in graduated pools (4KB/16KB/64KB/256KB).
PG18 AIO (headline feature):
io_method = worker # default; all platforms
# io_method = io_uring # Linux 5.1+; requires --with-liburing; fastest
io_workers = 8 # ~25% of CPU cores
effective_io_concurrency = 200 # NVMe; 20–50 SATA SSD; 1–4 HDD
maintenance_io_concurrency = 64
io_combine_limit = 128 # pages per AIO request; larger = higher throughputVerify liburing: SELECT setting FROM pg_config() WHERE name='CONFIGURE' → grep liburing.
Other PG18: skip scan B-tree (multi-column index without leading eq predicate), parallel GIN builds, statistics preserved across pg_upgrade, uuidv7() (monotonic UUIDs, fewer B-tree splits), virtual generated columns, wire protocol v3.2.
PG19 Beta: SQL/PGQ native graph queries, ON CONFLICT DO SELECT (atomic get-or-create), parallel autovacuum, pg_plan_advice hints framework, online REPACK, 64-bit MultiXact, JIT off by default, WAIT FOR LSN on replicas.
Production postgresql.conf:
shared_buffers = 8GB # 25% RAM
work_mem = 64MB # per sort/hash op (careful: × max_connections × 4)
maintenance_work_mem = 1GB # VACUUM, CREATE INDEX
wal_compression = zstd # 40–60% WAL reduction
max_wal_size = 8GB
checkpoint_completion_target = 0.9
synchronous_commit = off # async commit; risk: ~200ms on crash
huge_pages = try| Type | Use When | Downside |
|---|---|---|
| B-Tree | Most scalar columns; <, =, LIKE 'x%', BETWEEN |
Write amplification on splits |
| Partial | WHERE active=true; low-cardinality flag — 5–50× smaller |
Query WHERE must syntactically imply index predicate (form must match) |
| Expression | lower(email), date_trunc(...) |
Applies only to matching expression |
| GIN | Full-text, JSONB @>, arrays &&; tsvector |
Expensive updates; enable fastupdate=on; VACUUM pressure |
| GIN parallel | PG18: CREATE INDEX USING gin(...) uses parallel workers |
— |
Partition pruning works only when: partition key in WHERE, constant or bound param, enable_partition_pruning=on. Function call on partition key (date_trunc(...)) prevents pruning. Use pg_partman for auto-creation.
Partition strategies: RANGE (time-series), LIST (region/enum), HASH (uniform distribution). Each partition gets its own index — smaller and faster to maintain.
CREATE EXTENSION IF NOT EXISTS timescaledb;
SELECT create_hypertable('metrics', 'time', chunk_time_interval => INTERVAL '1 day');
-- Compression: 90–98% space saving on cold chunks
ALTER TABLE metrics SET (timescaledb.compress, timescaledb.compress_segmentby='device_id',
timescaledb.compress_orderby='time DESC');
SELECT add_compression_policy('metrics', INTERVAL '7 days');
-- Continuous aggregate with auto-refresh
CREATE MATERIALIZED VIEW metrics_1h WITH (timescaledb.continuous) AS
SELECT time_bucket('1h', time) AS bucket, device_id, avg(cpu) FROM metrics GROUP BY 1,2;
SELECT add_continuous_aggregate_policy('metrics_1h', '3 hours', '1 hour', '1 hour');
SELECT add_retention_policy('metrics', INTERVAL '90 days');CREATE EXTENSION vector;
-- HNSW (preferred): graph-based ANN, ~95% recall
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m=16, ef_construction=64);
SET hnsw.ef_search = 200; -- nodes explored per query; higher = better recall
-- IVFFlat alternative: smaller index, ~90% recall
CREATE INDEX ON documents USING ivfflat (embedding vector_l2_ops) WITH (lists=100);
SET ivfflat.probes = 10;
-- Nearest-neighbor query
SELECT id FROM documents ORDER BY embedding <=> $1 LIMIT 10;
-- <=> cosine distance; <-> L2; <#> inner productAfter failover: SELECT pg_prewarm('idx_embedding'). Per-partition HNSW for multi-tenant. work_mem=256MB, parallel_workers_per_gather=4 for heavy ANN loads.
pg_textsearch (Timescale, v1.0 March 2026, PostgreSQL-native pages, Block-Max WAND): CREATE INDEX USING bm25ts (content). Query: ORDER BY content <@> 'query' DESC. 2.4–6.5× faster than ParadeDB for 2–4 term queries at 138M docs; 8.7× higher concurrent throughput.
pg_search (ParadeDB v0.22.5, Rust/Tantivy, AGPL): CREATE INDEX USING bm25(id, desc, category) WITH (key_field='id'). Query: WHERE tbl @@@ paradedb.parse('desc:fast'). Supports faceted search, fuzzy (~2), highlight, hybrid BM25+pgvector.
Choose pg_textsearch for pure throughput/concurrency; pg_search for full Elasticsearch-style features.
LOAD 'age'; SET search_path = ag_catalog, "$user", public;
CREATE EXTENSION IF NOT EXISTS age;
SELECT create_graph('g');
SELECT * FROM cypher('g', $$ CREATE (:Person {name:'Alice'})-[:KNOWS]->(:Person {name:'Bob'}) $$) AS (v agtype);
SELECT * FROM cypher('g', $$ MATCH (a:Person)-[:KNOWS*1..3]->(b) RETURN b.name $$) AS (name agtype);Hybrid SQL+Cypher: join cypher result with relational table. GIN index on properties column for agtype filter performance.
# 1) PRIMARY postgresql.conf
wal_level=replica; max_wal_senders=10; wal_keep_size=1024
wal_compression=zstd; synchronous_commit=on; hot_standby=on
max_slot_wal_keep_size=4096 # cap WAL retention if replica goes offline (MB)
io_method=worker; io_workers=8 # PG18 AIO on WAL-sender path
# 2) PRIMARY: replication user + pg_hba.conf
psql -c "CREATE ROLE replicator WITH REPLICATION LOGIN ENCRYPTED PASSWORD 'pass';"
echo "host replication replicator 192.168.1.11/32 scram-sha-256" >> pg_hba.conf
psql -c "SELECT pg_create_physical_replication_slot('replica1_slot');"
systemctl reload postgresql@18-main
# 3) REPLICA: clone (writes standby.signal + primary_conninfo automatically)
systemctl stop postgresql@18-main && rm -rf /var/lib/postgresql/18/main
pg_basebackup --host=192.168.1.10 --username=replicator \
--pgdata=/var/lib/postgresql/18/main \
--wal-method=stream --write-recovery-conf --checkpoint=fast --progress
# 4) REPLICA postgresql.conf additions
hot_standby=on; hot_standby_feedback=on # feedback stops primary VACUUM removing rows replica needs
max_standby_streaming_delay=30s
io_method=worker; io_workers=4
systemctl start postgresql@18-main-- 5) Verify — on PRIMARY:
SELECT client_addr, state, sync_state,
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag
FROM pg_stat_replication; -- expect state=streaming, lag=0 bytes
-- on REPLICA:
SELECT pg_is_in_recovery(); -- true
SELECT now() - pg_last_xact_replay_timestamp() AS delay;
-- Slot lag (growing = lagging replica blocks WAL cleanup):
SELECT slot_name, active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained
FROM pg_replication_slots;Sync (zero data loss): synchronous_commit=remote_apply; synchronous_standby_names='FIRST 1 (replica1)' — adds 1 RTT per commit; remote_write for lower latency.
Logical (cross-version, per-table): CREATE PUBLICATION p FOR TABLE t1,t2; → CREATE SUBSCRIPTION s CONNECTION '...' PUBLICATION p WITH (streaming=on);
Failover: SELECT pg_promote(); on the replica → repoint other replicas' primary_conninfo. Use Patroni for automation; PgBouncer in front of each tier.
InnoDB-Based Binary Log (headline feature):
log-bin
binlog-storage-engine = innodb # stores binlog in InnoDB tablespace
# eliminates 2PC; halves fsyncs; 4× write throughput
# 2.4× faster single-thread; 50% commit latency reduction
# crash-safe without sync overhead
# Safe with InnoDB binlog (was dangerous with file-based binlog):
innodb_flush_log_at_trx_commit = 2
sync_binlog = 0Full my.cnf key params:
server_id = 1
binlog-storage-engine = innodb
gtid_domain_id = 1; gtid_strict_mode = ON
binlog_format = ROW; binlog_row_image = MINIMAL # 50–80% smaller binlog
innodb_buffer_pool_size = 12G # 70–80% RAM
innodb_buffer_pool_instances = 8 # one per GB pool
innodb_flush_method = O_DIRECT
innodb_io_capacity = 2000 # NVMe: 8000–20000
innodb_redo_log_capacity = 4G
thread_handling = pool-of-threads; thread_pool_size = 32
query_cache_type = 0; query_cache_size = 0 # disabled 10.6+Other 12.3 features: native vector search in storage, JOIN_FIXED_ORDER, MAX_EXECUTION_TIME optimizer hints, caching_sha2_password (MySQL 8 compat), XML type, aria_pagecache_segments (1–128 for parallel Aria).
# 1) PRIMARY /etc/mysql/mariadb.conf.d/50-server.cnf
[mariadb]
server_id=1
log-bin
log_basename=mariadb-primary
binlog-storage-engine=innodb # 12.3: crash-safe binlog, no 2PC, 4× write throughput
gtid_domain_id=1; gtid_strict_mode=ON
binlog_format=ROW; binlog_row_image=MINIMAL # 50–80% smaller binlog
sync_binlog=0 # safe WITH InnoDB binlog (redo log covers it)
innodb_flush_log_at_trx_commit=2 # safe WITH InnoDB binlog
expire_logs_days=7
# 2) REPLICA config
[mariadb]
server_id=2 # must be unique
log-bin; log_basename=mariadb-replica1
binlog-storage-engine=innodb
gtid_domain_id=1; gtid_strict_mode=ON
log_slave_updates=ON # needed for chained replicas
read_only=ON; super_read_only=ON # block even SUPER users from writing
relay_log=/var/log/mysql/relay-bin; relay_log_purge=ON-- 3) PRIMARY: replication user
CREATE USER 'replicator'@'192.168.1.%' IDENTIFIED BY 'pass';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'192.168.1.%';# 4) Backup primary (hot, no locks) → restore on replica
mariabackup --backup --target-dir=/backup/base --user=root --password=...
mariabackup --prepare --target-dir=/backup/base
grep gtid_binlog_pos /backup/base/xtrabackup_info # e.g. 0-1-14728
# on replica: stop mariadb; rm -rf /var/lib/mysql/*; mariabackup --copy-back ...; chown -R mysql:mysql-- 5) REPLICA: point at primary using GTID and start
SET GLOBAL gtid_slave_pos = '0-1-14728';
CHANGE MASTER TO MASTER_HOST='192.168.1.10', MASTER_USER='replicator',
MASTER_PASSWORD='pass', MASTER_USE_GTID=slave_pos;
START SLAVE;
SHOW SLAVE STATUS\G -- Slave_IO_Running=Yes, Slave_SQL_Running=Yes, Seconds_Behind_Master=0
-- 6) Failover: on promoted replica
STOP SLAVE; RESET SLAVE ALL;
SET GLOBAL read_only=OFF; SET GLOBAL super_read_only=OFF;
-- repoint others: CHANGE MASTER TO MASTER_HOST='new', MASTER_USE_GTID=current_pos; START SLAVE;MaxScale for automation: module=mariadbmon monitor with auto_failover=true, auto_rejoin=true; router=readwritesplit service sends writes→primary, reads→replicas; apps connect to MaxScale listener (e.g. :4006) instead of 3306.
INSTALL SONAME 'ha_tidesdb';
CREATE TABLE events (...) ENGINE=TidesDB;TidesDB vs RocksDB (NVMe, HammerDB TPC-C, Feb 2026): p50 GET 3 µs vs 4 µs; p99 GET 7 µs vs 12 µs; iteration 1.42× faster; storage 5.6× smaller. Write-heavy TPC-C: TidesDB wins; read-dominant: InnoDB wins. Stable with jemalloc; RocksDB crashes. v8.6: max_memory_usage field in tidesdb_config_t caps total engine footprint.
KV databases: TidesDB v9.0.8 (C, ACID+SSI, 5 isolation levels, column families, Kafka Streams drop-in); BadgerDB 4.7.0 (pure Go, WiscKey LSM, SSI, Dgraph/Jaeger/Pyroscope production).
| Format | Use When |
|---|---|
| Parquet | Analytical queries; 5–20× smaller than JSON; row group 128 MB, ZSTD L3 |
| Iceberg | Parquet + ACID + time travel + partition evolution + O(partitions) pruning |
| DuckLake | Iceberg semantics; catalog = DuckDB .db file; zero infra dependency |
| Delta Lake | Databricks-native; Delta UniForm for cross-engine reads |
| Lance | ML training data; O(1) random row access; native vector columns; zero-copy mmap→PyTorch |
| Arrow IPC/Flight | In-process zero-copy; Flight SQL replaces JDBC (10–50× throughput) |
# Rust maximum performance
RUSTFLAGS="-C target-cpu=native -C opt-level=3 -C lto=fat -C codegen-units=1 -C panic=abort"
# LLD is default linker on x86_64 Linux since Rust 1.90 (2–4× faster than GNU ld)
# PGO (Profile-Guided Optimization): +10–20%
# 1. Build with instrumentation 2. Run production-like workload 3. Rebuild with profile
cargo-pgo build; cargo-pgo run; cargo-pgo optimize # automates the workflow
# BOLT post-link (on top of PGO): additional 5–15%
cargo-bolt -- perf record -e cycles -j any,u -- ./app; cargo-bolt optimize
# GCC/Clang PGO
-fprofile-generate → run → -fprofile-use
-fprofile-partial-training # for unexercised paths
# LTO (Link-Time Optimization)
-flto=full # GCC/Clang; enables cross-TU inlining
# Java AOT cache (Leyden, PG18–26+, ZGC-compatible in Java 26)
java -XX:AOTCache:create=app.aot -jar app.jar # training run
java -XX:AOTCache=app.aot -jar app.jar # productionLinux 7.0 (Apr 12, 2026; current: 7.0.10 May 23, 2026; 7.1 expected Jul 2026):
- Rust stable in kernel (first-class drivers)
- io_uring: non-circular queues (better cache), cBPF filters, IOPOLL completion fix
- Rebuilt hybrid CPU scheduler: P-cores for latency-sensitive, E-cores for background — automatic, no cgroup config
- Open Tree Namespace: faster container creation
- XFS self-healing: runtime metadata repair without unmount
- Lazy preemption default; Intel TSX enabled on newer chips
- 7.1: in-kernel NTFS R/W, AMD XDNA v3, ARM SVE2/SME stabilization
Kernel parameters (sysctl):
vm.swappiness=1
vm.dirty_ratio=15; vm.dirty_background_ratio=5
net.core.netdev_max_backlog=65536
net.ipv4.ip_local_port_range="1024 65535"
net.ipv4.tcp_syncookies=1; net.ipv4.tcp_slow_start_after_idle=0
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag
echo 1048576 > /proc/sys/fs/file-max
ulimit -n 1048576CPU isolation (latency-critical threads):
# GRUB: isolcpus=4-15 nohz_full=4-15 rcu_nocbs=4-15
taskset -c 4-15 ./app # pin to isolated cores
I/O scheduler: none (NVMe), mq-deadline (SATA SSD), bfq (mixed/HDD).
sched_ext (Linux 6.12+): BPF-programmable scheduler. Write custom scheduling policies in BPF; loaded at runtime without kernel recompile.
PREEMPT_LAZY (default in Linux 7.0): reduces unnecessary preemption while maintaining responsiveness. RT threads: chrt -f 99 ./app.
Huge pages: echo N > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages. THP: madvise mode (don't use always globally — causes latency spikes on anonymous alloc).
IRQ affinity: pin NIC interrupts to dedicated cores. cat /proc/interrupts | grep eth0 → echo 1 > /proc/irq/N/smp_affinity.
Mitigations cost: mitigations=off recovers 2–5% (compute) to 15–30%+ (syscall-heavy: DBs, proxies) on trusted-code-only hosts. Never on multi-tenant/untrusted-guest machines. Audit: lscpu | grep -A20 Vulnerab.
resctrl (Intel RDT / AMD QoS): partition shared L3 + throttle memory bandwidth per cgroup — kills noisy-neighbor p99 spikes. mount -t resctrl resctrl /sys/fs/resctrl; write way-masks to schemata (L3:0=ff0), throttle batch to MB:0=30; monitor llc_occupancy/mbm_total_bytes.
MGLRU (6.1+): generational page reclaim — better working-set detection, fewer refaults under pressure. echo y > /sys/kernel/mm/lru_gen/enabled. DAMON: kernel access-frequency monitoring + actions (proactive reclaim, hot-page THP collapse, CXL tier promotion) via damo.
6.x LTS reference: 6.12 (LTS, Dec 2026 EOL): sched_ext stable, PREEMPT_RT mainlined. 6.18 (LTS, Dec 2027 EOL): recommended server LTS. 6.19: last 6.x release.
- mimalloc v3.3.2 (Jan 2026) · jemalloc 5.3.1 · Abseil TCMalloc (google/tcmalloc) · gperftools 2.17
- rkyv 0.8 (zero-copy Rust) · Cap'n Proto 1.0 · FlatBuffers 24.12 · protobuf 29.x · Apache Arrow 20.0
- Google Highway 1.2 · xxHash3 0.8.3 · wyhash 4.2 · Abseil flat_hash_map (2025-01) · RoaringBitmap 1.3 · DashMap 6.1 (Rust)
- Tokio 1.51 LTS (MSRV 1.71; until Mar 2027) · Axum 0.8.x · quic-go 0.49 (UDP GSO/GRO) · Netty 4.2 · LMAX Disruptor 4.0 · Aeron 1.47
- Zstd 1.5.7 (Feb 2026; dict 4.9× faster) · LZ4 1.10.0 · zlib-ng 2.2.4 (2–3× faster) · Brotli 1.1.0
- TidesDB v9.0.8 · BadgerDB 4.7.0 · DuckDB 1.3 · Polars 1.24 · SQLite 3.46
- async-profiler 3.0 (Java) · Intel VTune 2025.0 · bpftrace 0.22 · Parca 0.19 · clinic.js 12.0 (Node.js) · perf (Linux 7.0)
- CPU:
perf record -g -F 999 -- ./app→perf script | flamegraph.pl > flame.svg. Widest towers = hottest. - Cache:
perf stat -e cache-misses,L1-dcache-load-misses,LLC-load-misses ./app. LLC miss > 1% is significant. - Memory:
heaptrack(C++),async-profiler -e alloc(Java),pprof heap(Go),memray(Python). - Locks:
async-profiler -e lock(Java),pprof mutex(Go),perf lock(Linux). - I/O:
iostat -x 1— watchawait,%util.bpftrace -e 'tracepoint:block:block_rq_complete { @lat = hist(args->nr_sector); }'. - Network:
ss -s,netstat -s | grep retransmit,ethtool -S eth0 | grep error.
Benchmark discipline: 30+ runs, p50/p95/p99/p999, never mean-only. Coordinated omission: closed-loop generators (wrk, ab) under-report p99 by 10–1000× during server stalls — use open-loop wrk2 -R, Gatling constant-rate, or k6 constant-arrival-rate + HdrHistogram corrected recording. Environment: governor=performance, turbo off, ASLR off, taskset+numactl pinned. Warm JVM ≥ 100K iterations. Flame graph before any optimization.
- TLS 1.3 only (
ssl_protocols TLSv1.3); session tickets for resumption; OCSP stapling. AEADciphers (AES-256-GCM, ChaCha20-Poly1305) via AES-NI hardware — negligible CPU overhead.- Memory-safe defaults: bounds-checked slices, no raw pointer arithmetic without explicit unsafe.
scram-sha-256for DB auth (MariaDB 12.3:caching_sha2_password).- Input validation at boundaries: check length/type/range at ingestion; don't repeat per function.
O_CLOEXECon file descriptors; close extra fds beforeexec().- Separate processes/namespaces for secret handling; don't log secrets or tokens.
Standard binary search: cache-hostile (first 8 iterations each cause L3 miss on 4 MB array = ~320 ns).
Eytzinger layout: sort in BFS order (dst[k] = src[i] where k descends 1→2→4→...). Top 4 levels fit in 3 cache lines. Search: k = 2k + (a[k] < x); __builtin_prefetch(a + k16, 0, 0)` — ~40% faster than binary search for large N.
SIMD terminal scan (for final 32 elements): load 4×__m256i; _mm256_cmpgt_epi32 + _mm256_movemask_epi8; __builtin_popcount(mask)/4 = lower_bound index. 4 SIMD loads + 4 compares + popcount = ~2–4 cycles total.
Hybrid: Eytzinger descent (N → ~32) + SIMD linear scan = ~65% faster than std::lower_bound for N ∈ [512, 10K].
AVX-512 terminal scan: __mmask16 m = _mm512_cmplt_epi32_mask(v, vt); int pos = __builtin_popcount(m) — 16 elements per instruction, zero branches.
| Algorithm | Library | When | Gain |
|---|---|---|---|
| pdqsort (pattern-defeating quicksort) | Rust slice::sort_unstable, C++ std::ranges::sort |
Default for any comparison sort | Already in stdlib; branchless 64-element partition |
vqsort (Google Highway) |
hwy::VQSort(arr, n, hwy::SortAscending()) |
N > 1K; int32/int64/f32/f64 | 3–8× vs std::sort; AVX-512 compress/expand for partitioning |
| SIMD radix sort (LSD, 8-bit digits) | ska_sort (C++), ips4o (parallel), rdxsort (Rust) |
Integer/float keys N > 100K | O(N) theoretical; SIMD histogram build |
| Sort networks (fixed N) | Manual intrinsics | N = 4/8/16/32 (e.g., sorting SIMD registers, tuple keys) | 8 f32 in ~6 cycles (vs ~40 cycles scalar insertion sort) |
| Cache-oblivious merge | Block size = L1/2 × element_size | Large arrays exceeding L2 | Bottom-up; merge on L1-resident subarrays |
- NUMA allocation:
numactl --membind=N --cpunodebind=N ./app. Linux CXL memory appears as additional NUMA node. - **
mbind(addr, len, MPOL_BIND, &cxl_nodemask, ...)for CXL tier placement.MPOL_PREFERRED_MANYfor DDR5-first, CXL fallback. MemTis/TPP(Transparent Page Placement): auto-migrate pages DDR5↔CXL by access frequency.- CXL latency: ~120–200 ns (vs ~50–80 ns DDR5). Use for warm/cold data; keep hot data in local DDR5.
- CXL 3.0: shared memory pool across multiple hosts via CXL switch — relevant for distributed caches.
- Tools:
lscpu | grep NUMA,numastat,numactl --hardware,ls /sys/bus/cxl/devices/.
-
EXPLAIN (ANALYZE, BUFFERS)on every production query. - Index strategy reviewed: B-tree vs partial vs GIN; partition pruning verified.
- Heap/GC configured:
-Xms=Xmx, GC type chosen per workload, logging enabled. - Allocator chosen: mimalloc v3 / jemalloc; 1/4 RAM pre-warm in place.
- io_uring enabled where applicable: PG18
io_method=io_uring, server I/O paths. - TLS: PSK resumption, OCSP stapling, distributed KEK rotation.
- Compression: Zstd dictionary trained on representative samples for payloads < 4 KB.
- Hot paths profiled: flame graph shows expected top functions, not surprises.
- False sharing audit: shared counters/atomics padded to 64 bytes.
- SIMD opportunities: inner loops auto-vectorizing? Check with
-fopt-info-vec. - PGO applied to production binary (10–20% gain for free).
- MALLOC_CONF / mimalloc env vars set for allocator.
- sysctl applied:
vm.swappiness=1,tcp_slow_start_after_idle=0, THP=madvise. -
isolcpusset for latency-critical threads. - Benchmarks: 30 runs, p99 ≤ SLA, regression CI in place.