Skip to content

Instantly share code, notes, and snippets.

@konradish
Created March 28, 2026 20:44
Show Gist options
  • Select an option

  • Save konradish/35b7f1a7e97f0b9aeb866226c5880d34 to your computer and use it in GitHub Desktop.

Select an option

Save konradish/35b7f1a7e97f0b9aeb866226c5880d34 to your computer and use it in GitHub Desktop.
TurboQuant + llama-swap: 65K context on a single RTX 4080. One Dockerfile.

TurboQuant + llama-swap: 65K context on a single RTX 4080

Drop-in Dockerfile that builds TurboQuant llama-server and swaps it into llama-swap. One image, same config format, just add -ctk turbo3 -ctv turbo3.

Benchmarks: Qwen3.5-9B Q8_0 on RTX 4080 (16GB)

Metric Stock llama.cpp TurboQuant (turbo3)
Prefill 48.9 tok/s 4,176 tok/s
Decode 44.5 tok/s 18.2 tok/s
Context window 8K (VRAM-limited) 65K
KV cache compression 1x (FP16) ~5x (3.5 bpv)
VRAM free (model loaded) 820 MB 1,314 MB

Needle-in-a-haystack: model correctly retrieves a single sentence buried in 21K tokens of noise.

Tested by pasting an entire Project Gutenberg book and asking questions about it. It works.

Why the prefill jump?

The stock numbers (48.9 tok/s) were from an earlier run on standard llama.cpp with FP16 KV at 8K context. The TurboQuant build also picks up the latest Flash Attention kernels from spiritbuun's fork, which combined with turbo3 KV quantization gives the massive prefill improvement. Decode is slower due to the dequant step — this is a known tradeoff that's being optimized upstream.

Dockerfile

ARG CUDA_VERSION=12.4.0
ARG UBUNTU_VERSION=22.04

# === Stage 1: Build TurboQuant llama-server from source ===
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION} AS build

RUN apt-get update && \
    apt-get install -y --no-install-recommends \
      build-essential cmake git libssl-dev libgomp1 libcurl4-openssl-dev && \
    rm -rf /var/lib/apt/lists/*

ARG CUDA_ARCHITECTURES=89
ARG TURBOQUANT_REPO=https://github.com/spiritbuun/llama-cpp-turboquant-cuda.git
ARG TURBOQUANT_BRANCH=feature/turboquant-kv-cache

WORKDIR /src
RUN git clone --depth 1 -b ${TURBOQUANT_BRANCH} ${TURBOQUANT_REPO} . && \
    cmake -B build \
      -DGGML_NATIVE=OFF \
      -DGGML_CUDA=ON \
      -DGGML_CUDA_FA=ON \
      -DGGML_CUDA_FA_ALL_QUANTS=ON \
      -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCHITECTURES} \
      -DLLAMA_BUILD_SERVER=ON \
      -DLLAMA_BUILD_TESTS=OFF \
      -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs -lcuda" \
      -DCMAKE_SHARED_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs -lcuda" && \
    cmake --build build --config Release -j$(nproc)

RUN mkdir -p /artifacts/lib && \
    cp build/bin/llama-server /artifacts/ && \
    find build -name "*.so*" -exec cp -P {} /artifacts/lib/ \;

# === Stage 2: Drop into llama-swap image ===
FROM ghcr.io/mostlygeek/llama-swap:cuda

COPY --from=build /artifacts/llama-server /app/llama-server
COPY --from=build /artifacts/lib/ /app/

RUN llama-server --version || true

llama-swap config

models:
  qwen3.5-9b:
    ttl: 600
    cmd: |
      llama-server
      -hf mradermacher/Huihui-Qwen3.5-9B-abliterated-GGUF:Q8_0
      --port ${PORT} --host 0.0.0.0
      -ngl 99 -fa on
      -ctk turbo3 -ctv turbo3
      -c 65536 -t 16

docker-compose.yml

services:
  llama-swap:
    build:
      context: .
      args:
        CUDA_ARCHITECTURES: "89"  # 4080/4090=89, 3090/3080=86, 2080=75
    volumes:
      - ./config.yaml:/app/config.yaml:ro
      - llama_cpp_cache:/root/.cache/llama.cpp
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  llama_cpp_cache:

Gotchas we hit building this

  1. Missing FA flags: spiritbuun's stock Dockerfiles don't pass -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON. Without these, turbo cache types compile but Flash Attention kernels don't — and turbo requires FA at runtime (KV is stored in FWHT-rotated space).

  2. CUDA stubs linker error: Build fails with undefined reference to cuMemCreate etc. Fix: -DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs -lcuda" (and same for CMAKE_SHARED_LINKER_FLAGS).

  3. Binary path: llama-swap base image has /app on PATH. If you COPY to /usr/bin/, the stock /app/llama-server shadows it. Copy to /app/ to replace.

  4. Branch: Must use feature/turboquant-kv-cache, not master. Master is just upstream llama.cpp with no turbo code.

Cache types

Flag Bits/value Compression Notes
turbo3 3.5 bpv ~5x Recommended default
turbo4 4.25 bpv ~3.8x Higher quality
turbo2 2.5 bpv ~6.4x Max compression

Where TurboQuant matters most

TurboQuant compresses KV cache, not model weights. Impact depends on architecture:

  • Standard attention (Gemma 3, Llama 3, Qwen3-14B): Transformative — 4-5x more context
  • Hybrid DeltaNet (Qwen3.5-9B, Qwen3.5-35B-A3B): Nice-to-have — DeltaNet already minimizes KV
  • Pure SSM (RWKV, Mamba): No KV cache, irrelevant

Status

TurboQuant is pre-merge in llama.cpp (PR #21089, PR #21038). This Dockerfile becomes unnecessary once it lands in mainline. Until then, this is the easiest way to run it with llama-swap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment