TurboQuant + llama-swap: 65K context on a single RTX 4080

Drop-in Dockerfile that builds TurboQuant llama-server and swaps it into llama-swap. One image, same config format, just add -ctk turbo3 -ctv turbo3.

Benchmarks: Qwen3.5-9B Q8_0 on RTX 4080 (16GB)

Metric	Stock llama.cpp	TurboQuant (turbo3)
Prefill	48.9 tok/s	4,176 tok/s
Decode	44.5 tok/s	18.2 tok/s
Context window	8K (VRAM-limited)	65K
KV cache compression	1x (FP16)	~5x (3.5 bpv)
VRAM free (model loaded)	820 MB	1,314 MB

Needle-in-a-haystack: model correctly retrieves a single sentence buried in 21K tokens of noise.

Tested by pasting an entire Project Gutenberg book and asking questions about it. It works.

Why the prefill jump?

The stock numbers (48.9 tok/s) were from an earlier run on standard llama.cpp with FP16 KV at 8K context. The TurboQuant build also picks up the latest Flash Attention kernels from spiritbuun's fork, which combined with turbo3 KV quantization gives the massive prefill improvement. Decode is slower due to the dequant step — this is a known tradeoff that's being optimized upstream.

Dockerfile

ARG CUDA_VERSION=12.4.0
ARG UBUNTU_VERSION=22.04

# === Stage 1: Build TurboQuant llama-server from source ===
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION} AS build

RUN apt-get update && \
    apt-get install -y --no-install-recommends \
      build-essential cmake git libssl-dev libgomp1 libcurl4-openssl-dev && \
    rm -rf /var/lib/apt/lists/*

ARG CUDA_ARCHITECTURES=89
ARG TURBOQUANT_REPO=https://github.com/spiritbuun/llama-cpp-turboquant-cuda.git
ARG TURBOQUANT_BRANCH=feature/turboquant-kv-cache

WORKDIR /src
RUN git clone --depth 1 -b ${TURBOQUANT_BRANCH} ${TURBOQUANT_REPO} . && \
    cmake -B build \
      -DGGML_NATIVE=OFF \
      -DGGML_CUDA=ON \
      -DGGML_CUDA_FA=ON \
      -DGGML_CUDA_FA_ALL_QUANTS=ON \
      -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCHITECTURES} \
      -DLLAMA_BUILD_SERVER=ON \
      -DLLAMA_BUILD_TESTS=OFF \
      -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs -lcuda" \
      -DCMAKE_SHARED_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs -lcuda" && \
    cmake --build build --config Release -j$(nproc)

RUN mkdir -p /artifacts/lib && \
    cp build/bin/llama-server /artifacts/ && \
    find build -name "*.so*" -exec cp -P {} /artifacts/lib/ \;

# === Stage 2: Drop into llama-swap image ===
FROM ghcr.io/mostlygeek/llama-swap:cuda

COPY --from=build /artifacts/llama-server /app/llama-server
COPY --from=build /artifacts/lib/ /app/

RUN llama-server --version || true

llama-swap config

models:
  qwen3.5-9b:
    ttl: 600
    cmd: |
      llama-server
      -hf mradermacher/Huihui-Qwen3.5-9B-abliterated-GGUF:Q8_0
      --port ${PORT} --host 0.0.0.0
      -ngl 99 -fa on
      -ctk turbo3 -ctv turbo3
      -c 65536 -t 16

docker-compose.yml

services:
  llama-swap:
    build:
      context: .
      args:
        CUDA_ARCHITECTURES: "89"  # 4080/4090=89, 3090/3080=86, 2080=75
    volumes:
      - ./config.yaml:/app/config.yaml:ro
      - llama_cpp_cache:/root/.cache/llama.cpp
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  llama_cpp_cache:

Gotchas we hit building this

Missing FA flags: spiritbuun's stock Dockerfiles don't pass -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON. Without these, turbo cache types compile but Flash Attention kernels don't — and turbo requires FA at runtime (KV is stored in FWHT-rotated space).
CUDA stubs linker error: Build fails with undefined reference to cuMemCreate etc. Fix: -DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs -lcuda" (and same for CMAKE_SHARED_LINKER_FLAGS).
Binary path: llama-swap base image has /app on PATH. If you COPY to /usr/bin/, the stock /app/llama-server shadows it. Copy to /app/ to replace.
Branch: Must use feature/turboquant-kv-cache, not master. Master is just upstream llama.cpp with no turbo code.

Cache types

Flag	Bits/value	Compression	Notes
`turbo3`	3.5 bpv	~5x	Recommended default
`turbo4`	4.25 bpv	~3.8x	Higher quality
`turbo2`	2.5 bpv	~6.4x	Max compression

Where TurboQuant matters most

TurboQuant compresses KV cache, not model weights. Impact depends on architecture:

Standard attention (Gemma 3, Llama 3, Qwen3-14B): Transformative — 4-5x more context
Hybrid DeltaNet (Qwen3.5-9B, Qwen3.5-35B-A3B): Nice-to-have — DeltaNet already minimizes KV
Pure SSM (RWKV, Mamba): No KV cache, irrelevant

Status

TurboQuant is pre-merge in llama.cpp (PR #21089, PR #21038). This Dockerfile becomes unnecessary once it lands in mainline. Until then, this is the easiest way to run it with llama-swap.

konradish/gist-turboquant.md

Select an option

No results found