install llama.cpp: go to https://github.com/ggml-org/llama.cpp and follow the instructions of your platform. Alternatively you can use other inference engines of your choice.
install llama-swap: https://github.com/mostlygeek/llama-swap and follow the instruction
Best model which fits in 12GB VRAM to date is https://huggingface.co/prithivMLmods/Ophiuchi-Qwen3-14B-Instruct choose a quantization which fits in the VRAM and still has enough room fo the context. Nanabrowser uses a lot of tokens (>10K).
"qwen3":
ttl: 300
cmd: /home/ai/llama.cpp/build/bin/llama-server --port 9027
--flash-attn --metrics
-ctk q8_0
-ctv q8_0
--slots
--model Ophiuchi-Qwen3-14B-Instruct.i1-Q4_K_S.gguf
-ngl 49
--ctx-size 12000
--top_k 1
--cache-reuse 256
--jinja
proxy: http://127.0.0.1:9027
- temp 0.7
- topp 0.1