AGENT.md — Extreme Performance Reference (Compact Edition)

June 2026 · Java 25/26 · Go 1.26 · Rust 1.94 · Python 3.14 · Node 24 · Linux 7.0/6.18 LTS · PG 18.4 · MariaDB 12.3.2 LTS

Latency Numbers (2026 Hardware)

Operation	Latency
L1 hit	~1 ns
L2 hit	~4 ns
L3 hit	~10–40 ns
DDR5 DRAM	~50–80 ns
HBM3e (on-package; latency ~20% HIGHER than DDR5 — advantage is bandwidth 1.2 TB/s/stack)	~100–150 ns
CXL 2.0/3.0 over PCIe	~120–200 ns
NVMe PCIe 5 seq read	~50–100 µs
NVMe 4K random read	~100–200 µs
Datacenter TCP RTT	~500 µs
Mutex uncontended	~20–40 ns
Mutex contended	~200–1000 ns
Context switch	~1–10 µs
syscall (ring 0)	~100–1000 ns
vDSO (clock_gettime)	~5–15 ns
CAS uncontended	~5–10 ns
CAS cross-core	~50–500 ns
AVX-512 op 16×f32 (Zen 5)	~0.5–1 ns
GPU kernel launch	~5–10 µs
Branch misprediction	~10–20 cycles
TLB miss 4 KB page	~100–1000 cycles
TLB miss 2 MB page	~20–100 cycles

Rule: L1 → DRAM gap is 60×. Design data layout around this.

Data Types & Memory

Smallest correct primitive: use uint8/16/int32 not int64/float64 when domain permits — packs more per cache line.
Intern strings at ingestion; replace comparisons with integer IDs in hot paths.
No interface{}/any/Object in hot paths — fat pointer + no inlining.
Fixed-point (int32 × 1000) for bounded monetary/sensor values; avoids FPU stalls.
Pack booleans: uint8 bitfield for 8 flags vs 8 bool fields = 7 bytes saved.
Cache line = 64 bytes on x86/ARM/RISC-V. Struct field order: largest → smallest, eliminates compiler padding.
False sharing: pad independently-written variables to 64 bytes (alignas(64), #[repr(align(64))]).
SoA over AoS when a loop touches only a subset of fields — packs relevant data contiguously.
AoSoA (blocks of 4–8 = SIMD width) for SIMD-heavy math kernels.
SIMD arrays: 32-byte aligned (AVX2), 64-byte aligned (AVX-512). posix_memalign, aligned_alloc, #[repr(align(64))].
Zero-copy parsing: store (offset, length) pairs into raw buffer. Only copy when ownership transfers across threads.
O_DIRECT / mmap buffers: align to 4096 bytes.
Prefetch: __builtin_prefetch(addr, 0, 3) 100–300 ns ahead of pointer-chasing loops.

Data Structures — Internals / Use When / Downsides

Hash Maps

Variant	Internals	Use When	Downside
Swiss Tables (open addr)	Flat array + 1-byte H2 metadata; SIMD `_mm_cmpeq_epi8` checks 16 slots	Short keys < 32B; Go 1.24+ built-in; C++ Abseil `flat_hash_map`	Load factor cap ~70%; resize = O(N) copy spike
Chaining	Array of linked-list bucket heads	High load factor > 90%; variable-size keys	Pointer chase = cache miss per element; alloc-heavy
Perfect hash	Offline-computed MPH; 2–3 ns/key; zero collisions	Static key sets (HTTP headers, opcodes)	Cannot add keys post-build
Sharded map	N independent shards; `shard = hash(key) & (N-1)`	Concurrent writes; reduce lock contention by factor N	Cross-shard ops require N locks; hot key still contends

Hash functions: integer keys → wyhash/xxHash3; strings < 64B → wyhash; long strings → xxHash3 SIMD; untrusted keys → SipHash-1-3.

Arrays / Collections

Dynamic array / Vec: (ptr, len, cap) triple; O(1) append within cap; resize = 2× copy. Pre-allocate: make([]T, 0, cap) / Vec::with_capacity(n) / new ArrayList<>(n).
Ring buffer: power-of-two size; head & (size-1) for wrap. Allocation-free FIFO. Use for SPSC I/O/event pipelines.
Bag (multiset): {element → count} hash map or sorted array + binary search.

Lock-Free

Structure	Internals	Use When	Downside
SPSC queue	Ring buffer; `head`/`tail` on separate cache lines; acquire/release only	One producer, one consumer; 10–30 ns/op	Strictly single-producer, single-consumer
Disruptor	Pre-allocated ring; sequence numbers; CAS claim; wait strategies (busy-spin < yield < park)	MPMC; tens of millions msg/sec; sub-100 ns	Fixed ring size; slow consumer blocks fast producer
Sharded	N independent instances; lock per shard	Concurrent maps/caches/rate limiters	Global iteration requires N locks

Memory ordering cost: x86 (TSO): acquire/release free; ARM/RISC-V: explicit barriers ~10–20 ns. Use weakest correct ordering.

Trees

Type	Internals	Use When	Downside
B+Tree	Multi-key cache-line nodes; leaves linked	Sorted in-memory maps N > 1000; all DB storage engines	Write amplification on node splits
ART	Node4/16/48/256 by child count; SIMD byte-compare; path compression	IP routing, prefix enum, sorted string maps	8+ bytes/node overhead; complex concurrent impl
Van Emde Boas	BFS layout; subtree of height h in contiguous memory	Static sorted tables queried many times	Read-only after construction; unfamiliar API

Other Structures

Structure	Use When	Key Detail
Bloom filter (blocked)	Pre-filter before expensive op; 10 bits/elem = 1% FPR	Blocked variant: 1 cache line per query (`io_uring_register_buf_ring`)
Cuckoo filter	Bloom + deletion support	2 bucket reads; load factor > 95% risks insert failure
Skip list	Sorted concurrent access; lock-free per-node CAS	O(log N); ~O(log N) pointer overhead vs B-tree
CSR graph	`offsets[V+1]` + `edges[E]` flat arrays	BFS/DFS/Dijkstra/PageRank; cache-optimal traversal
Columnar (Arrow)	All values of column N contiguous	Analytics: filter/sum/group touching few columns
Disruptor event bus	Pre-alloc ring + per-subscriber sequence counter	In-process pub/sub; millions events/sec

Control Flow

GVN/CSE: hoist loop-invariant loads into const-locals before loop. for(int i=0; i<obj->len; i++) → const int len = obj->len; const T* d = obj->data; — one load vs N.
Loop unswitching: if a branch predicate is stable across all iterations, manually split into two loops. Each inner loop is clean and vectorizable.
Cold-path outlining: __attribute__((noinline, cold)) on all error handlers, assertions, rare branches. Keeps hot basic blocks dense in i-cache.
SROA: prefer flat local variables and structs that don't escape a function — compiler promotes them to registers.
Inlining: __attribute__((always_inline)) / #[inline(always)] for hot ≤ 20-instruction functions. #[cold] #[inline(never)] for error paths.
Loop unrolling: manual for inner loops < 8 iterations. #pragma GCC unroll N / #[cfg(target_feature = "avx2")].
Loop tiling: for 2D loops on matrices, tile to L1/L2 cache (tile size = sqrt(cache_size / element_size)).
Avoid virtual dispatch in hot paths: interface methods prevent inlining; use concrete types + monomorphization (Rust generics, C++ templates).
Devirtualization hint: call-site type hint or PGO feedback causes compiler to inline virtual calls.
Auto-vectorization triggers: unit-stride access, no pointer aliasing (restrict/noalias), loop counter known, no function calls inside loop, no breaks/continues.
LICM: if value doesn't change per iteration, it must not be inside the loop.
SLP vectorization: same operation on independent scalar chains → compiler packs into SIMD. Enable with -O3 -march=native.

Memory Management

Allocator Config (1/4 RAM preallocated, tiny+large mix)

jemalloc 5.3.1

MALLOC_CONF="narenas:$(nproc),tcache:true,tcache_max:32768,lg_tcache_nslots_mul:2,\
background_thread:true,dirty_decay_ms:5000,muzzy_decay_ms:10000,\
oversize_threshold:8388608,retain:true,metadata_thp:auto,thp:always"

Pre-warm: mallocx(RAM/4, MALLOCX_ARENA(0)|MALLOCX_TCACHE_NONE) → memset → dallocx. With retain:true, freed extents stay in address space.

tcmalloc / gperftools 2.17

TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=$((512*1024*1024))  # 512 MB
TCMALLOC_RELEASE_RATE=0.1

Abseil TCMalloc programmatic: SetMaxTotalThreadCacheBytes(512<<20), SetBackgroundReleaseRate(0). Call MarkThreadIdle()/MarkThreadBusy() for pool threads.

mimalloc v3.3.2 (recommended default 2026)

MIMALLOC_RESERVE_HUGE_OS_PAGES=$((RAM_MB/2048))  # 1 GB pages
MIMALLOC_ALLOW_THP=1
MIMALLOC_PAGE_RESET=0        # no page reset on free (faster short-lived alloc reuse)
MIMALLOC_DECOMMIT_DELAY=500  # 500 ms before OS decommit
MIMALLOC_ARENA_EAGER_COMMIT=1

v3 per-request heap: mi_heap_new() → mi_heap_malloc() → mi_heap_destroy() (bulk free).

Choosing: mimalloc v3 → new projects; jemalloc → lowest p99 latency + mixed tiny/large; Abseil TCMalloc → extreme multi-thread throughput; jemalloc/mimalloc → TidesDB/RocksDB (crashes with glibc).

Arena / Object Pools

Per-request scratch arena: 64 KB blocks from pool; all allocs from bump pointer; single free at request end.
Object pools: pre-create N at startup; lock-free SPSC/MPMC free list; monitor high-water mark.
Pool objects: ByteBuffers, connection objects, parser instances, ZSTD_CCtx/ZSTD_DCtx.
Off-heap for JVM: MemorySegment (JDK 22+ Panama, stable); Arena.ofConfined() for scoped lifecycle.

GC — JVM (Java 25/26)

G1GC (default, general purpose)

-XX:+UseG1GC -Xms=Xmx -XX:MaxGCPauseMillis=100
-XX:G1HeapRegionSize=16m        # heap/2048; power of 2; 1–32 MB
-XX:G1NewSizePercent=20 -XX:G1MaxNewSizePercent=40
-XX:InitiatingHeapOccupancyPercent=40  # start concurrent mark earlier
-XX:G1ReservePercent=15
-XX:+G1UseAdaptiveIHOP
-XX:+AlwaysPreTouch -XX:+UseNUMA -XX:+UseTransparentHugePages

Java 25: remembered set merge → 2 GB→0.75 GB on 64 GB heap. Java 26 JEP 522: 15% throughput gain from reduced sync overhead.

ZGC (Java 25 = Generational only; -XX:+ZGenerational is no-op now)

-XX:+UseZGC -Xms=Xmx
-XX:SoftMaxHeapSize=<0.75*Xmx>  # normal-load cap; burst to Xmx
-XX:ZAllocationSpikeTolerance=2
-XX:ZUncommitDelay=300
-XX:ConcGCThreads=4
-XX:AOTCache=app.aot             # Java 26+: ZGC + AOT startup compatible
-XX:+AlwaysPreTouch -XX:+UseNUMA -XX:+UseTransparentHugePages

Java 25 Mapped Cache: replaces Page Cache, fixes inflated RSS, reduces fragmentation. ~5–10% throughput overhead vs G1. Needs ~10–15% more total memory.

Decision: G1 for REST APIs / web services (P99 < 100 ms); ZGC for latency-critical (P99 < 10 ms) or caches (16–200 GB); G1 relaxed for batch/ETL; ZGC for HFT + NUMA pin. Shenandoah and Parallel GC: still in JDK but not recommended for new deployments.

JFR always-on: jcmd <pid> JFR.start duration=60s filename=app.jfr settings=profile. Java 25 JEP 509: CPU-time profiling (settings=cpu-time).

GC — Node.js / V8 (Node 24 LTS)

# Heap structure: New Space (Nursery + To-Space) | Old Space | Code Space | Large Object
# --max-old-space-size limits ONLY JS heap, not Buffers/native/Code Space → RSS always higher

# High-throughput API (many short-lived objects):
node --max-old-space-size=2048 --max-semi-space-size=128 --max-code-cache-size=256 app.js

# Containerized (512 MB pod):
node --max-old-space-size-percentage=70 app.js   # Node 22+; adapts to any container size

# Data processing (large objects, small churn):
node --max-old-space-size=8192 --max-semi-space-size=32 app.js

Semi-space sizing: = (expected_alloc_per_request × 2 × concurrency) / 4. Larger semi-space → fewer Scavenges → less premature promotion. Monitor: v8.getHeapStatistics().used_heap_size / heap_size_limit > 0.85 → OOM risk.

I/O

io_uring (Linux 5.1+; liburing 2.9; PG18 `io_method=io_uring`)

⚠ Blocked by default in Docker ≥ 25 / containerd ≥ 2.0 seccomp + Google fleet. Ship an epoll fallback; in containers allowlist io_uring_setup/enter/register in a custom seccomp profile (never seccomp=unconfined).

Non-circular queue (Linux 7.0): better cache perf for IOPOLL; fixes mixed-device completion deferral.
cBPF filters (Linux 7.0): per-ring operation allow/deny in containers.
IORING_SETUP_SQPOLL: kernel polling thread → zero syscalls at sustained I/O.
IORING_OP_ACCEPT_MULTISHOT / IORING_OP_RECV_MULTISHOT: one SQE handles all connections/data arrivals.
IORING_OP_SEND_ZC (6.0+): zero-copy send; app notified when buffer safe to reuse.
io_uring_register_buf_ring() (5.19+): pre-registered buffers; kernel selects free buffer per recv.
IORING_OP_FUSE (6.14+): FUSE via io_uring, 20–40% FUSE latency reduction.
Ring resize (6.13+): io_uring_resize_rings() without reconnecting.

Libraries: liburing (C), tokio-uring 0.5/monoio 0.2 (Rust), Netty IOUring (Java), iouring-go (Go).

Buffer Management

Read buffer size: ≥ 1 full application frame (HTTP/2: 16 KB; binary: max PDU size).
Write buffering: TCP_CORK / MSG_MORE to hold until batch full. Then flush in one sendmsg.
Page-cache bypass: O_DIRECT for database-type I/O (aligned 4096, size multiple of block size).
mmap for large read-only datasets: OS page cache manages eviction; no read() syscalls.
MADV_WILLNEED: prefault pages before hot path. MADV_DONTNEED to release without freeing VA.

UDP GSO/GRO (Linux 4.18+ GSO; 5.0+ GRO; 6.2+ virtio TUN)

GSO: batch N datagrams into one super-packet; UDP_SEGMENT cmsg with per-segment size. One sendmsg() → N wire datagrams. Requires tx-udp-segmentation + tx-checksum-ip-generic in ethtool.
GRO: setsockopt(fd, SOL_UDP, UDP_GRO, &1, sizeof(int)); recvmsg returns super-packet + UDP_GRO cmsg with stride.
Fallback: sendmmsg/recvmmsg for environments without hardware GSO — still batches syscalls.
Result: Tailscale 10 Gb/s sustained (was 1–3 Gb/s pre-GSO/GRO). QUIC/HTTP3 uses same path.

Concurrency

SPSC: ring buffer, power-of-two size, head/tail on separate 64-byte cache lines, acquire/release only. 10–30 ns/op.
Disruptor: pre-allocated ring, sequence numbers, CAS claim, busy-spin wait. Sub-100 ns; tens of millions msg/s.
Sharding: N independent structures, shard = hash(key) & (N-1). Reduce contention by N. N = CPU cores × 4.
Atomic ordering (cheapest → costliest): relaxed → acquire/release → seq_cst. x86: acquire/release free (TSO); ARM: barriers ~10–20 ns.
False sharing: pad independently-written atomics/counters to 64 bytes.
Thread pool: size = CPU count for CPU-bound; higher for I/O-bound (benchmark). Avoid ThreadPerTask.
Work stealing: ForkJoinPool (Java), Tokio scheduler (Rust), Go runtime (M:N). Minimizes thread idle time.
rseq (restartable sequences, glibc 2.35+): per-CPU counters/freelists at plain load/store speed — no atomics, ever. Kernel restarts the section on preemption/migration. Abseil TCMalloc per-CPU mode is built on it.
NUMA: allocate memory on same NUMA node as threads that use it. numactl --membind=N --cpunodebind=N. GOMAXPROCS reads cgroup CPU quota in Go 1.25+.
Lock-free invariant: load(acquire) pairs with store(release). CAS loops: exponential backoff or yield before retry.

Networking & IPC

TCP/Socket Options

TCP_NODELAY=1        // disable Nagle; critical for request-response
SO_KEEPALIVE=1       // detect dead peers
SO_REUSEADDR=1       // fast restart
SO_REUSEPORT=1       // parallel accept across threads
SO_RCVBUF=4*1024*1024
SO_SNDBUF=4*1024*1024
SO_BUSY_POLL=50      // microseconds; avoids sleep/wakeup; uses CPU

Linux TCP: tcp_slow_start_after_idle=0, tcp_congestion_control=bbr, net.core.netdev_max_backlog=65536, net.ipv4.ip_local_port_range="1024 65535".

TLS 1.3

1-RTT PSK resumption: server sends NewSessionTicket after handshake; client re-sends PSK in next ClientHello. Go: tls.Config.SetSessionTicketKeys() for distributed KEK sharing. Nginx: ssl_session_cache shared:SSL:50m; ssl_session_timeout 1d.
0-RTT early data: client sends data with ClientHello — zero RTT. Use ONLY for idempotent ops (GET, reads). Anti-replay: Redis SET NX per ticket + ±5s time window.
Distributed KEK: rotate every 24h, accept old for 48h. Store in Vault/Secrets Manager. Ticket lifetime: 24h default, 1h for high-security APIs.
OCSP stapling: eliminates 100–500 ms CA round-trip on new sessions. Nginx: ssl_stapling on; ssl_stapling_verify on;.
40% of Cloudflare TLS sessions are resumptions.

Zstd Custom Dictionary

zstd --train samples/* -o dict.zstd --maxdict=114688   # 112 KB dict

Hot path: ZSTD_createCDict_byReference() once; ZSTD_compress_usingCDict(cctx, ...) per call. No alloc. Similarly ZSTD_DDict + ZSTD_decompress_usingDDict. Zstd 1.5.7 (Feb 2026): 4.9× faster dict ops. Gains: 30–40× on API responses; 60–90× on similar-structure payloads (Roblox feature flags).

Logging, Metrics & Observability

Async thread-local logging: thread-local ring buffer → MPSC queue → single background writer. No sprintf in hot path; pre-format to fixed struct; format in background.
Level check: single atomic load before any formatting. DEBUG = no-op in production.
Rate limiting: emit first occurrence; suppress + summary every 1 s.
Metrics: per-CPU-core counters (no synchronization); aggregate at scrape time. HDR Histogram for latency; Prometheus format.
Tracing: W3C TraceContext (traceparent); 1% head sampling or tail-based (bias slow/error). OpenTelemetry → Jaeger/Tempo. Span overhead: skip per-iteration spans.
Continuous profiling: eBPF (Parca, Pyroscope); zero overhead when idle.

Architecture Principles

Monolith over microservices: in-process call < 10 ns vs network call 500 µs–5 ms. Shared memory 1000× faster than REST for same-host data.
No runtime reflection/DI in hot path: Spring/Guice startup cost, proxy overhead, megamorphic call sites. Use compile-time codegen (Quarkus, Micronaut, Dagger) or manual wiring.
Avoid dynamic dispatch in hot loops: virtual calls prevent inlining; cause megamorphic sites (> 3 types = 5–20× slower). Use concrete types + generics/templates (Rust impl Trait, C++ templates).
Codegen at build time: Protobuf/FlatBuffers, JOOQ/SQLc, ANTLR. Generated code is inlineable, statically typed, optimizable.
No DI containers in request path: context object pattern — pass RequestContext struct through call chain.

FFI & Native Interop

General Rules

Batch work across FFI boundary: one call with large array vs N calls × 1 element.
CGO cost: ~60–200 ns per call (goroutine stack switch). Batch.
Pass pre-allocated output buffers — avoid native code allocating and returning heap ptrs.
Off-heap exchange medium: no GC pinning needed.

Polars Shim (Universal SIMD Backend)

Pattern: Rust cdylib shim (polars crate) → extern "C" opaque handles (u64) → any language via its FFI. Arrow C Data Interface (ArrowSchema/ArrowArray structs) for zero-copy data transfer.

Key: polars_df_execute_plan(json_plan) — ship entire lazy query plan as JSON string; single FFI call. Polars does predicate pushdown + parallel execution.

Callers: Bun.js via bun:ffi (dlopen, FFIType.u64, ptr(TypedArray) = zero copy); Go via CGO.

ArrayFire (Universal GPU/SIMD Backend)

Access methods:

Official language binding (pip install arrayfire, npm install arrayfire-js, cargo add arrayfire).
Direct C ABI (libaf handles) via any FFI.
Custom Rust cdylib shim for domain-fused ops (preserves JIT kernel fusion across FFI boundary).

Critical: chain ops inside shim to preserve JIT fusion. af_shim_matmul_sigmoid() = one GPU kernel; calling matmul then sigmoid separately = two kernels + VRAM roundtrip.

Backend order: CUDA → OpenCL → CPU (BLAS+AVX-512). set_backend(Backend::DEFAULT).

Hardware & SIMD

CPU Microarchitecture (2026)

Property	Key	Code implication
OoOE (ROB 200–600)	Decouples issue from execute	Expose independent ops; use multiple accumulators
Superscalar (4–8 ports/cycle)	Mix port types in inner loops	Interleave load/store/ALU to saturate all ports
Branch predictor (TAGE)	> 99% accuracy regular patterns	`cmov`/branchless for per-element data; arrange loop exits as "not taken"
Store-forwarding	Same-address load after store: ~5 cycles	Fails (10–15 cycle stall) on size/alignment mismatch
MLP (10–20 outstanding MSHR)	Sequential access saturates; pointer-chase serializes	Prefer flat arrays; issue `__builtin_prefetch` 200–400 ns ahead
I-cache / µop cache (1500–4000 µops)	Hot loops must fit	`#[cold]` / `__attribute__((cold))` for error paths

2026 silicon: Zen 5 = full 512-bit AVX-512 datapath (4 native units; 40–50% uplift vs Zen 4). Intel Panther Lake = Intel 18A (GAA transistors + BSPDN), mobile-first, Q1 2026. AMD Zen 6 (EPYC Venice H2 2026) = TSMC 2nm N2P, AVX10.2, FP8.

SIMD ISA (2026)

ISA	Width	CPUs	Key Ops
AVX2	256-bit	All modern x86	8×f32/int; baseline SIMD target
AVX-512	512-bit	Zen 4/5, Xeon 6	16×f32; mask registers; compress/expand
AVX10.1	512-bit	Xeon 6 (Granite Rapids)	Unified ISA: all AVX-512 sub-ext in one CPUID bit
AVX10.2	512-bit	Xeon Diamond Rapids 2026, Zen 6	FP16/BF16 scalar+vector; OCP FP8 (E4M3/E5M2); IEEE NaN semantics
ARM SVE2	128–2048-bit (runtime)	Cortex-X4, Graviton 4	Variable-width; one binary for all widths
ARM SME	—	Cortex-X925, Apple M4	Matrix outer product; standardized AMX equivalent
Intel AMX	8 tiles (16×64 B each)	Xeon 4/5/6 only	1024 MAC/instruction; ~512 INT8 TOPS/socket

Google Highway 1.2 (github.com/google/highway): one source → AVX2/AVX-512/AVX10/NEON/SVE2/WASM. HWY_DYNAMIC_DISPATCH. vqsort = 3–8× vs std::sort. Preferred over raw intrinsics.

AVX10.2 detection: CPUID leaf 24H, EAX=24H, ECX=0H → EBX[7:0] >= 2.

LLM Inference Serving

Decoding is memory-bandwidth-bound: tokens/sec ≈ mem_BW / bytes_per_token. 70B FP16 (140 GB) on 3.35 TB/s ≈ 24 tok/s single-stream ceiling; 4-bit (35 GB) ≈ 96 tok/s. This formula explains most inference behavior.

Technique	Gain	Tool
Continuous batching (evict/admit per decode step)	5–20× vs static batching	vLLM, TGI, TensorRT-LLM
PagedAttention (KV cache in OS-style pages)	Recovers 60–80% fragmented KV memory	vLLM
Prefix caching / RadixAttention (shared system prompts)	Large TTFT cut at high QPS	SGLang
Speculative decoding (draft k tokens, verify in 1 pass)	2–3× decode, identical output dist	EAGLE, Medusa
KV cache FP8/INT8	2× max batch size	Hopper+/AVX10.2 FP8
Weight 4-bit (AWQ/GPTQ/FP4)	~4× decode throughput, ≤1% quality loss	AWQ
Chunked prefill	Long prompt no longer stalls decode latency	vLLM/SGLang
CUDA Graphs decode capture	Removes ~5–10 µs × N kernel-launch overhead/token	TensorRT-LLM

TP (tensor parallel) within NVLink node; PP (pipeline) across nodes. Stacks: vLLM (default), SGLang (shared-prefix QPS), TensorRT-LLM (peak NVIDIA), llama.cpp (CPU/edge).

Protocol Parsing (Zero-Allocation State Machine)

Parser state = integer cursor variables in CPU registers only. No heap alloc in hot path.
Store (start: uint16, len: uint16) offsets into raw receive buffer — not copies.
ErrIncomplete: return without consuming bytes; caller accumulates in ring buffer + retries.
Bulk delimiter scan: bytes.IndexByte (Go) / memchr (C/Rust) — SIMD, skips 16–32 bytes/cycle vs 1.
No fmt.Sprintf, regex.Match, json.Unmarshal inside state machine. Identify boundaries first; validate semantics in app layer.
Pre-allocate output struct from pool; fill in-place; single-pass left-to-right; no backtracking.
Tiered codec pipeline: receive buffer → AES-NI decrypt → LZ4/Zstd decompress → state machine parser → frame struct. All buffers pre-allocated in graduated pools (4KB/16KB/64KB/256KB).

Database

PostgreSQL 18.4 (current) / PG19 Beta 1 (June 4, 2026)

PG18 AIO (headline feature):

io_method = worker            # default; all platforms
# io_method = io_uring        # Linux 5.1+; requires --with-liburing; fastest
io_workers = 8                # ~25% of CPU cores
effective_io_concurrency = 200   # NVMe; 20–50 SATA SSD; 1–4 HDD
maintenance_io_concurrency = 64
io_combine_limit = 128         # pages per AIO request; larger = higher throughput

Verify liburing: SELECT setting FROM pg_config() WHERE name='CONFIGURE' → grep liburing.

Other PG18: skip scan B-tree (multi-column index without leading eq predicate), parallel GIN builds, statistics preserved across pg_upgrade, uuidv7() (monotonic UUIDs, fewer B-tree splits), virtual generated columns, wire protocol v3.2.

PG19 Beta: SQL/PGQ native graph queries, ON CONFLICT DO SELECT (atomic get-or-create), parallel autovacuum, pg_plan_advice hints framework, online REPACK, 64-bit MultiXact, JIT off by default, WAIT FOR LSN on replicas.

Production postgresql.conf:

shared_buffers = 8GB               # 25% RAM
work_mem = 64MB                    # per sort/hash op (careful: × max_connections × 4)
maintenance_work_mem = 1GB         # VACUUM, CREATE INDEX
wal_compression = zstd             # 40–60% WAL reduction
max_wal_size = 8GB
checkpoint_completion_target = 0.9
synchronous_commit = off           # async commit; risk: ~200ms on crash
huge_pages = try

PostgreSQL Indexes

Type	Use When	Downside
B-Tree	Most scalar columns; `<`, `=`, `LIKE 'x%'`, BETWEEN	Write amplification on splits
Partial	`WHERE active=true`; low-cardinality flag — 5–50× smaller	Query WHERE must syntactically imply index predicate (form must match)
Expression	`lower(email)`, `date_trunc(...)`	Applies only to matching expression
GIN	Full-text, JSONB `@>`, arrays `&&`; tsvector	Expensive updates; enable `fastupdate=on`; VACUUM pressure
GIN parallel	PG18: `CREATE INDEX USING gin(...)` uses parallel workers	—

Partition pruning works only when: partition key in WHERE, constant or bound param, enable_partition_pruning=on. Function call on partition key (date_trunc(...)) prevents pruning. Use pg_partman for auto-creation.

Partition strategies: RANGE (time-series), LIST (region/enum), HASH (uniform distribution). Each partition gets its own index — smaller and faster to maintain.

TimescaleDB

CREATE EXTENSION IF NOT EXISTS timescaledb;
SELECT create_hypertable('metrics', 'time', chunk_time_interval => INTERVAL '1 day');
-- Compression: 90–98% space saving on cold chunks
ALTER TABLE metrics SET (timescaledb.compress, timescaledb.compress_segmentby='device_id',
  timescaledb.compress_orderby='time DESC');
SELECT add_compression_policy('metrics', INTERVAL '7 days');
-- Continuous aggregate with auto-refresh
CREATE MATERIALIZED VIEW metrics_1h WITH (timescaledb.continuous) AS
  SELECT time_bucket('1h', time) AS bucket, device_id, avg(cpu) FROM metrics GROUP BY 1,2;
SELECT add_continuous_aggregate_policy('metrics_1h', '3 hours', '1 hour', '1 hour');
SELECT add_retention_policy('metrics', INTERVAL '90 days');

pgvector

CREATE EXTENSION vector;
-- HNSW (preferred): graph-based ANN, ~95% recall
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m=16, ef_construction=64);
SET hnsw.ef_search = 200;     -- nodes explored per query; higher = better recall
-- IVFFlat alternative: smaller index, ~90% recall
CREATE INDEX ON documents USING ivfflat (embedding vector_l2_ops) WITH (lists=100);
SET ivfflat.probes = 10;
-- Nearest-neighbor query
SELECT id FROM documents ORDER BY embedding <=> $1 LIMIT 10;
-- <=> cosine distance; <-> L2; <#> inner product

After failover: SELECT pg_prewarm('idx_embedding'). Per-partition HNSW for multi-tenant. work_mem=256MB, parallel_workers_per_gather=4 for heavy ANN loads.

BM25 Full-Text Search

pg_textsearch (Timescale, v1.0 March 2026, PostgreSQL-native pages, Block-Max WAND): CREATE INDEX USING bm25ts (content). Query: ORDER BY content <@> 'query' DESC. 2.4–6.5× faster than ParadeDB for 2–4 term queries at 138M docs; 8.7× higher concurrent throughput.

pg_search (ParadeDB v0.22.5, Rust/Tantivy, AGPL): CREATE INDEX USING bm25(id, desc, category) WITH (key_field='id'). Query: WHERE tbl @@@ paradedb.parse('desc:fast'). Supports faceted search, fuzzy (~2), highlight, hybrid BM25+pgvector.

Choose pg_textsearch for pure throughput/concurrency; pg_search for full Elasticsearch-style features.

Apache AGE (Graph Extension, PG 11–18)

LOAD 'age'; SET search_path = ag_catalog, "$user", public;
CREATE EXTENSION IF NOT EXISTS age;
SELECT create_graph('g');
SELECT * FROM cypher('g', $$ CREATE (:Person {name:'Alice'})-[:KNOWS]->(:Person {name:'Bob'}) $$) AS (v agtype);
SELECT * FROM cypher('g', $$ MATCH (a:Person)-[:KNOWS*1..3]->(b) RETURN b.name $$) AS (name agtype);

Hybrid SQL+Cypher: join cypher result with relational table. GIN index on properties column for agtype filter performance.

PostgreSQL 18 Primary–Replica Replication (Complete Setup)

# 1) PRIMARY postgresql.conf
wal_level=replica; max_wal_senders=10; wal_keep_size=1024
wal_compression=zstd; synchronous_commit=on; hot_standby=on
max_slot_wal_keep_size=4096          # cap WAL retention if replica goes offline (MB)
io_method=worker; io_workers=8       # PG18 AIO on WAL-sender path

# 2) PRIMARY: replication user + pg_hba.conf
psql -c "CREATE ROLE replicator WITH REPLICATION LOGIN ENCRYPTED PASSWORD 'pass';"
echo "host replication replicator 192.168.1.11/32 scram-sha-256" >> pg_hba.conf
psql -c "SELECT pg_create_physical_replication_slot('replica1_slot');"
systemctl reload postgresql@18-main

# 3) REPLICA: clone (writes standby.signal + primary_conninfo automatically)
systemctl stop postgresql@18-main && rm -rf /var/lib/postgresql/18/main
pg_basebackup --host=192.168.1.10 --username=replicator \
  --pgdata=/var/lib/postgresql/18/main \
  --wal-method=stream --write-recovery-conf --checkpoint=fast --progress

# 4) REPLICA postgresql.conf additions
hot_standby=on; hot_standby_feedback=on   # feedback stops primary VACUUM removing rows replica needs
max_standby_streaming_delay=30s
io_method=worker; io_workers=4
systemctl start postgresql@18-main

-- 5) Verify — on PRIMARY:
SELECT client_addr, state, sync_state,
       pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag
FROM pg_stat_replication;        -- expect state=streaming, lag=0 bytes
-- on REPLICA:
SELECT pg_is_in_recovery();      -- true
SELECT now() - pg_last_xact_replay_timestamp() AS delay;
-- Slot lag (growing = lagging replica blocks WAL cleanup):
SELECT slot_name, active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained
FROM pg_replication_slots;

Sync (zero data loss): synchronous_commit=remote_apply; synchronous_standby_names='FIRST 1 (replica1)' — adds 1 RTT per commit; remote_write for lower latency. Logical (cross-version, per-table): CREATE PUBLICATION p FOR TABLE t1,t2; → CREATE SUBSCRIPTION s CONNECTION '...' PUBLICATION p WITH (streaming=on); Failover: SELECT pg_promote(); on the replica → repoint other replicas' primary_conninfo. Use Patroni for automation; PgBouncer in front of each tier.

MariaDB 12.3 LTS (May 29, 2026; supported June 2029)

InnoDB-Based Binary Log (headline feature):

log-bin
binlog-storage-engine = innodb   # stores binlog in InnoDB tablespace
                                  # eliminates 2PC; halves fsyncs; 4× write throughput
                                  # 2.4× faster single-thread; 50% commit latency reduction
                                  # crash-safe without sync overhead

# Safe with InnoDB binlog (was dangerous with file-based binlog):
innodb_flush_log_at_trx_commit = 2
sync_binlog = 0

Full my.cnf key params:

server_id = 1
binlog-storage-engine = innodb
gtid_domain_id = 1; gtid_strict_mode = ON
binlog_format = ROW; binlog_row_image = MINIMAL   # 50–80% smaller binlog
innodb_buffer_pool_size = 12G           # 70–80% RAM
innodb_buffer_pool_instances = 8        # one per GB pool
innodb_flush_method = O_DIRECT
innodb_io_capacity = 2000               # NVMe: 8000–20000
innodb_redo_log_capacity = 4G
thread_handling = pool-of-threads; thread_pool_size = 32
query_cache_type = 0; query_cache_size = 0   # disabled 10.6+

Other 12.3 features: native vector search in storage, JOIN_FIXED_ORDER, MAX_EXECUTION_TIME optimizer hints, caching_sha2_password (MySQL 8 compat), XML type, aria_pagecache_segments (1–128 for parallel Aria).

MariaDB 12.3 GTID Replication (Complete Setup)

# 1) PRIMARY /etc/mysql/mariadb.conf.d/50-server.cnf
[mariadb]
server_id=1
log-bin
log_basename=mariadb-primary
binlog-storage-engine=innodb     # 12.3: crash-safe binlog, no 2PC, 4× write throughput
gtid_domain_id=1; gtid_strict_mode=ON
binlog_format=ROW; binlog_row_image=MINIMAL   # 50–80% smaller binlog
sync_binlog=0                     # safe WITH InnoDB binlog (redo log covers it)
innodb_flush_log_at_trx_commit=2  # safe WITH InnoDB binlog
expire_logs_days=7

# 2) REPLICA config
[mariadb]
server_id=2                       # must be unique
log-bin; log_basename=mariadb-replica1
binlog-storage-engine=innodb
gtid_domain_id=1; gtid_strict_mode=ON
log_slave_updates=ON              # needed for chained replicas
read_only=ON; super_read_only=ON  # block even SUPER users from writing
relay_log=/var/log/mysql/relay-bin; relay_log_purge=ON

-- 3) PRIMARY: replication user
CREATE USER 'replicator'@'192.168.1.%' IDENTIFIED BY 'pass';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'192.168.1.%';

# 4) Backup primary (hot, no locks) → restore on replica
mariabackup --backup --target-dir=/backup/base --user=root --password=...
mariabackup --prepare --target-dir=/backup/base
grep gtid_binlog_pos /backup/base/xtrabackup_info     # e.g. 0-1-14728
# on replica: stop mariadb; rm -rf /var/lib/mysql/*; mariabackup --copy-back ...; chown -R mysql:mysql

-- 5) REPLICA: point at primary using GTID and start
SET GLOBAL gtid_slave_pos = '0-1-14728';
CHANGE MASTER TO MASTER_HOST='192.168.1.10', MASTER_USER='replicator',
  MASTER_PASSWORD='pass', MASTER_USE_GTID=slave_pos;
START SLAVE;
SHOW SLAVE STATUS\G   -- Slave_IO_Running=Yes, Slave_SQL_Running=Yes, Seconds_Behind_Master=0

-- 6) Failover: on promoted replica
STOP SLAVE; RESET SLAVE ALL;
SET GLOBAL read_only=OFF; SET GLOBAL super_read_only=OFF;
-- repoint others: CHANGE MASTER TO MASTER_HOST='new', MASTER_USE_GTID=current_pos; START SLAVE;

MaxScale for automation: module=mariadbmon monitor with auto_failover=true, auto_rejoin=true; router=readwritesplit service sends writes→primary, reads→replicas; apps connect to MaxScale listener (e.g. :4006) instead of 3306.

TidesDB v9.0.8 / TideSQL v4.2.4 (in MariaDB)

INSTALL SONAME 'ha_tidesdb';
CREATE TABLE events (...) ENGINE=TidesDB;

TidesDB vs RocksDB (NVMe, HammerDB TPC-C, Feb 2026): p50 GET 3 µs vs 4 µs; p99 GET 7 µs vs 12 µs; iteration 1.42× faster; storage 5.6× smaller. Write-heavy TPC-C: TidesDB wins; read-dominant: InnoDB wins. Stable with jemalloc; RocksDB crashes. v8.6: max_memory_usage field in tidesdb_config_t caps total engine footprint.

KV databases: TidesDB v9.0.8 (C, ACID+SSI, 5 isolation levels, column families, Kafka Streams drop-in); BadgerDB 4.7.0 (pure Go, WiscKey LSM, SSI, Dgraph/Jaeger/Pyroscope production).

Data Formats

Format	Use When
Parquet	Analytical queries; 5–20× smaller than JSON; row group 128 MB, ZSTD L3
Iceberg	Parquet + ACID + time travel + partition evolution + O(partitions) pruning
DuckLake	Iceberg semantics; catalog = DuckDB `.db` file; zero infra dependency
Delta Lake	Databricks-native; Delta UniForm for cross-engine reads
Lance	ML training data; O(1) random row access; native vector columns; zero-copy mmap→PyTorch
Arrow IPC/Flight	In-process zero-copy; Flight SQL replaces JDBC (10–50× throughput)

Build & Toolchain

# Rust maximum performance
RUSTFLAGS="-C target-cpu=native -C opt-level=3 -C lto=fat -C codegen-units=1 -C panic=abort"
# LLD is default linker on x86_64 Linux since Rust 1.90 (2–4× faster than GNU ld)

# PGO (Profile-Guided Optimization): +10–20%
# 1. Build with instrumentation  2. Run production-like workload  3. Rebuild with profile
cargo-pgo build; cargo-pgo run; cargo-pgo optimize   # automates the workflow

# BOLT post-link (on top of PGO): additional 5–15%
cargo-bolt -- perf record -e cycles -j any,u -- ./app; cargo-bolt optimize

# GCC/Clang PGO
-fprofile-generate → run → -fprofile-use
-fprofile-partial-training   # for unexercised paths

# LTO (Link-Time Optimization)
-flto=full   # GCC/Clang; enables cross-TU inlining

# Java AOT cache (Leyden, PG18–26+, ZGC-compatible in Java 26)
java -XX:AOTCache:create=app.aot -jar app.jar          # training run
java -XX:AOTCache=app.aot -jar app.jar                 # production

OS Tuning (Linux 7.0 / 6.18 LTS)

Linux 7.0 (Apr 12, 2026; current: 7.0.10 May 23, 2026; 7.1 expected Jul 2026):

Rust stable in kernel (first-class drivers)
io_uring: non-circular queues (better cache), cBPF filters, IOPOLL completion fix
Rebuilt hybrid CPU scheduler: P-cores for latency-sensitive, E-cores for background — automatic, no cgroup config
Open Tree Namespace: faster container creation
XFS self-healing: runtime metadata repair without unmount
Lazy preemption default; Intel TSX enabled on newer chips
7.1: in-kernel NTFS R/W, AMD XDNA v3, ARM SVE2/SME stabilization

Kernel parameters (sysctl):

vm.swappiness=1
vm.dirty_ratio=15; vm.dirty_background_ratio=5
net.core.netdev_max_backlog=65536
net.ipv4.ip_local_port_range="1024 65535"
net.ipv4.tcp_syncookies=1; net.ipv4.tcp_slow_start_after_idle=0
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag
echo 1048576 > /proc/sys/fs/file-max
ulimit -n 1048576

CPU isolation (latency-critical threads):

# GRUB: isolcpus=4-15 nohz_full=4-15 rcu_nocbs=4-15
taskset -c 4-15 ./app                 # pin to isolated cores

I/O scheduler: none (NVMe), mq-deadline (SATA SSD), bfq (mixed/HDD).

sched_ext (Linux 6.12+): BPF-programmable scheduler. Write custom scheduling policies in BPF; loaded at runtime without kernel recompile.

PREEMPT_LAZY (default in Linux 7.0): reduces unnecessary preemption while maintaining responsiveness. RT threads: chrt -f 99 ./app.

Huge pages: echo N > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages. THP: madvise mode (don't use always globally — causes latency spikes on anonymous alloc).

IRQ affinity: pin NIC interrupts to dedicated cores. cat /proc/interrupts | grep eth0 → echo 1 > /proc/irq/N/smp_affinity.

Mitigations cost: mitigations=off recovers 2–5% (compute) to 15–30%+ (syscall-heavy: DBs, proxies) on trusted-code-only hosts. Never on multi-tenant/untrusted-guest machines. Audit: lscpu | grep -A20 Vulnerab.

resctrl (Intel RDT / AMD QoS): partition shared L3 + throttle memory bandwidth per cgroup — kills noisy-neighbor p99 spikes. mount -t resctrl resctrl /sys/fs/resctrl; write way-masks to schemata (L3:0=ff0), throttle batch to MB:0=30; monitor llc_occupancy/mbm_total_bytes.

MGLRU (6.1+): generational page reclaim — better working-set detection, fewer refaults under pressure. echo y > /sys/kernel/mm/lru_gen/enabled. DAMON: kernel access-frequency monitoring + actions (proactive reclaim, hot-page THP collapse, CXL tier promotion) via damo.

6.x LTS reference: 6.12 (LTS, Dec 2026 EOL): sched_ext stable, PREEMPT_RT mainlined. 6.18 (LTS, Dec 2027 EOL): recommended server LTS. 6.19: last 6.x release.

Third-Party Library Versions (June 2026)

Allocators

mimalloc v3.3.2 (Jan 2026) · jemalloc 5.3.1 · Abseil TCMalloc (google/tcmalloc) · gperftools 2.17

Serialization

rkyv 0.8 (zero-copy Rust) · Cap'n Proto 1.0 · FlatBuffers 24.12 · protobuf 29.x · Apache Arrow 20.0

Hashing / Collections

Google Highway 1.2 · xxHash3 0.8.3 · wyhash 4.2 · Abseil flat_hash_map (2025-01) · RoaringBitmap 1.3 · DashMap 6.1 (Rust)

Concurrency / Networking

Tokio 1.51 LTS (MSRV 1.71; until Mar 2027) · Axum 0.8.x · quic-go 0.49 (UDP GSO/GRO) · Netty 4.2 · LMAX Disruptor 4.0 · Aeron 1.47

Compression

Zstd 1.5.7 (Feb 2026; dict 4.9× faster) · LZ4 1.10.0 · zlib-ng 2.2.4 (2–3× faster) · Brotli 1.1.0

Databases / Storage

TidesDB v9.0.8 · BadgerDB 4.7.0 · DuckDB 1.3 · Polars 1.24 · SQLite 3.46

Profiling

async-profiler 3.0 (Java) · Intel VTune 2025.0 · bpftrace 0.22 · Parca 0.19 · clinic.js 12.0 (Node.js) · perf (Linux 7.0)

Profiling Workflow

CPU: perf record -g -F 999 -- ./app → perf script | flamegraph.pl > flame.svg. Widest towers = hottest.
Cache: perf stat -e cache-misses,L1-dcache-load-misses,LLC-load-misses ./app. LLC miss > 1% is significant.
Memory: heaptrack (C++), async-profiler -e alloc (Java), pprof heap (Go), memray (Python).
Locks: async-profiler -e lock (Java), pprof mutex (Go), perf lock (Linux).
I/O: iostat -x 1 — watch await, %util. bpftrace -e 'tracepoint:block:block_rq_complete { @lat = hist(args->nr_sector); }'.
Network: ss -s, netstat -s | grep retransmit, ethtool -S eth0 | grep error.

Benchmark discipline: 30+ runs, p50/p95/p99/p999, never mean-only. Coordinated omission: closed-loop generators (wrk, ab) under-report p99 by 10–1000× during server stalls — use open-loop wrk2 -R, Gatling constant-rate, or k6 constant-arrival-rate + HdrHistogram corrected recording. Environment: governor=performance, turbo off, ASLR off, taskset+numactl pinned. Warm JVM ≥ 100K iterations. Flame graph before any optimization.

Security Without Overhead

TLS 1.3 only (ssl_protocols TLSv1.3); session tickets for resumption; OCSP stapling.
AEAD ciphers (AES-256-GCM, ChaCha20-Poly1305) via AES-NI hardware — negligible CPU overhead.
Memory-safe defaults: bounds-checked slices, no raw pointer arithmetic without explicit unsafe.
scram-sha-256 for DB auth (MariaDB 12.3: caching_sha2_password).
Input validation at boundaries: check length/type/range at ingestion; don't repeat per function.
O_CLOEXEC on file descriptors; close extra fds before exec().
Separate processes/namespaces for secret handling; don't log secrets or tokens.

SIMD Search & Sort Algorithms

SIMD Binary Search — Hybrid Eytzinger + SIMD Terminal Scan

Standard binary search: cache-hostile (first 8 iterations each cause L3 miss on 4 MB array = ~320 ns).

Eytzinger layout: sort in BFS order (dst[k] = src[i] where k descends 1→2→4→...). Top 4 levels fit in 3 cache lines. Search: k = 2k + (a[k] < x); __builtin_prefetch(a + k16, 0, 0)` — ~40% faster than binary search for large N.

SIMD terminal scan (for final 32 elements): load 4×__m256i; _mm256_cmpgt_epi32 + _mm256_movemask_epi8; __builtin_popcount(mask)/4 = lower_bound index. 4 SIMD loads + 4 compares + popcount = ~2–4 cycles total.

Hybrid: Eytzinger descent (N → ~32) + SIMD linear scan = ~65% faster than std::lower_bound for N ∈ [512, 10K].

AVX-512 terminal scan: __mmask16 m = _mm512_cmplt_epi32_mask(v, vt); int pos = __builtin_popcount(m) — 16 elements per instruction, zero branches.

SIMD Sort Algorithms

Algorithm	Library	When	Gain
pdqsort (pattern-defeating quicksort)	Rust `slice::sort_unstable`, C++ `std::ranges::sort`	Default for any comparison sort	Already in stdlib; branchless 64-element partition
`vqsort` (Google Highway)	`hwy::VQSort(arr, n, hwy::SortAscending())`	N > 1K; int32/int64/f32/f64	3–8× vs `std::sort`; AVX-512 compress/expand for partitioning
SIMD radix sort (LSD, 8-bit digits)	`ska_sort` (C++), `ips4o` (parallel), `rdxsort` (Rust)	Integer/float keys N > 100K	O(N) theoretical; SIMD histogram build
Sort networks (fixed N)	Manual intrinsics	N = 4/8/16/32 (e.g., sorting SIMD registers, tuple keys)	8 f32 in ~6 cycles (vs ~40 cycles scalar insertion sort)
Cache-oblivious merge	Block size = L1/2 × element_size	Large arrays exceeding L2	Bottom-up; merge on L1-resident subarrays

NUMA & CXL Memory

NUMA allocation: numactl --membind=N --cpunodebind=N ./app. Linux CXL memory appears as additional NUMA node.
**mbind(addr, len, MPOL_BIND, &cxl_nodemask, ...) for CXL tier placement. MPOL_PREFERRED_MANY for DDR5-first, CXL fallback.
MemTis / TPP (Transparent Page Placement): auto-migrate pages DDR5↔CXL by access frequency.
CXL latency: ~120–200 ns (vs ~50–80 ns DDR5). Use for warm/cold data; keep hot data in local DDR5.
CXL 3.0: shared memory pool across multiple hosts via CXL switch — relevant for distributed caches.
Tools: lscpu | grep NUMA, numastat, numactl --hardware, ls /sys/bus/cxl/devices/.

corporatepiyush/AgentCompact.md