install inference engine

install llama.cpp: go to https://github.com/ggml-org/llama.cpp and follow the instructions of your platform. Alternatively you can use other inference engines of your choice.

install OpenAI compatible layer

install llama-swap: https://github.com/mostlygeek/llama-swap and follow the instruction

download model

Best model which fits in 12GB VRAM to date is https://huggingface.co/prithivMLmods/Ophiuchi-Qwen3-14B-Instruct choose a quantization which fits in the VRAM and still has enough room fo the context. Nanabrowser uses a lot of tokens (>10K).

configuration

llama-swap

"qwen3":
    ttl: 300
    cmd: /home/ai/llama.cpp/build/bin/llama-server --port 9027
      --flash-attn --metrics
      -ctk q8_0
      -ctv q8_0
      --slots
      --model Ophiuchi-Qwen3-14B-Instruct.i1-Q4_K_S.gguf
      -ngl 49
      --ctx-size 12000
      --top_k 1
      --cache-reuse 256
      --jinja
    proxy: http://127.0.0.1:9027

nanobrowser

temp 0.7
topp 0.1

maximus2600/local.md

install inference engine

install OpenAI compatible layer

download model

configuration

llama-swap

nanobrowser