Drop-in Dockerfile that builds TurboQuant llama-server and swaps it into llama-swap. One image, same config format, just add -ctk turbo3 -ctv turbo3.
| Metric | Stock llama.cpp | TurboQuant (turbo3) |
|---|---|---|
| Prefill | 48.9 tok/s | 4,176 tok/s |
| Decode | 44.5 tok/s | 18.2 tok/s |
| Context window | 8K (VRAM-limited) | 65K |
| KV cache compression | 1x (FP16) | ~5x (3.5 bpv) |
| VRAM free (model loaded) | 820 MB | 1,314 MB |
Needle-in-a-haystack: model correctly retrieves a single sentence buried in 21K tokens of noise.
Tested by pasting an entire Project Gutenberg book and asking questions about it. It works.
The stock numbers (48.9 tok/s) were from an earlier run on standard llama.cpp with FP16 KV at 8K context. The TurboQuant build also picks up the latest Flash Attention kernels from spiritbuun's fork, which combined with turbo3 KV quantization gives the massive prefill improvement. Decode is slower due to the dequant step — this is a known tradeoff that's being optimized upstream.
ARG CUDA_VERSION=12.4.0
ARG UBUNTU_VERSION=22.04
# === Stage 1: Build TurboQuant llama-server from source ===
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION} AS build
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential cmake git libssl-dev libgomp1 libcurl4-openssl-dev && \
rm -rf /var/lib/apt/lists/*
ARG CUDA_ARCHITECTURES=89
ARG TURBOQUANT_REPO=https://github.com/spiritbuun/llama-cpp-turboquant-cuda.git
ARG TURBOQUANT_BRANCH=feature/turboquant-kv-cache
WORKDIR /src
RUN git clone --depth 1 -b ${TURBOQUANT_BRANCH} ${TURBOQUANT_REPO} . && \
cmake -B build \
-DGGML_NATIVE=OFF \
-DGGML_CUDA=ON \
-DGGML_CUDA_FA=ON \
-DGGML_CUDA_FA_ALL_QUANTS=ON \
-DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCHITECTURES} \
-DLLAMA_BUILD_SERVER=ON \
-DLLAMA_BUILD_TESTS=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs -lcuda" \
-DCMAKE_SHARED_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs -lcuda" && \
cmake --build build --config Release -j$(nproc)
RUN mkdir -p /artifacts/lib && \
cp build/bin/llama-server /artifacts/ && \
find build -name "*.so*" -exec cp -P {} /artifacts/lib/ \;
# === Stage 2: Drop into llama-swap image ===
FROM ghcr.io/mostlygeek/llama-swap:cuda
COPY --from=build /artifacts/llama-server /app/llama-server
COPY --from=build /artifacts/lib/ /app/
RUN llama-server --version || truemodels:
qwen3.5-9b:
ttl: 600
cmd: |
llama-server
-hf mradermacher/Huihui-Qwen3.5-9B-abliterated-GGUF:Q8_0
--port ${PORT} --host 0.0.0.0
-ngl 99 -fa on
-ctk turbo3 -ctv turbo3
-c 65536 -t 16services:
llama-swap:
build:
context: .
args:
CUDA_ARCHITECTURES: "89" # 4080/4090=89, 3090/3080=86, 2080=75
volumes:
- ./config.yaml:/app/config.yaml:ro
- llama_cpp_cache:/root/.cache/llama.cpp
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
llama_cpp_cache:-
Missing FA flags: spiritbuun's stock Dockerfiles don't pass
-DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON. Without these, turbo cache types compile but Flash Attention kernels don't — and turbo requires FA at runtime (KV is stored in FWHT-rotated space). -
CUDA stubs linker error: Build fails with
undefined reference to cuMemCreateetc. Fix:-DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs -lcuda"(and same forCMAKE_SHARED_LINKER_FLAGS). -
Binary path: llama-swap base image has
/appon PATH. If you COPY to/usr/bin/, the stock/app/llama-servershadows it. Copy to/app/to replace. -
Branch: Must use
feature/turboquant-kv-cache, notmaster. Master is just upstream llama.cpp with no turbo code.
| Flag | Bits/value | Compression | Notes |
|---|---|---|---|
turbo3 |
3.5 bpv | ~5x | Recommended default |
turbo4 |
4.25 bpv | ~3.8x | Higher quality |
turbo2 |
2.5 bpv | ~6.4x | Max compression |
TurboQuant compresses KV cache, not model weights. Impact depends on architecture:
- Standard attention (Gemma 3, Llama 3, Qwen3-14B): Transformative — 4-5x more context
- Hybrid DeltaNet (Qwen3.5-9B, Qwen3.5-35B-A3B): Nice-to-have — DeltaNet already minimizes KV
- Pure SSM (RWKV, Mamba): No KV cache, irrelevant
TurboQuant is pre-merge in llama.cpp (PR #21089, PR #21038). This Dockerfile becomes unnecessary once it lands in mainline. Until then, this is the easiest way to run it with llama-swap.