Skip to content

Instantly share code, notes, and snippets.

@maximus2600
Last active July 12, 2025 12:29
Show Gist options
  • Save maximus2600/750fca4fad4b6641d2b629f5a798fffd to your computer and use it in GitHub Desktop.
Save maximus2600/750fca4fad4b6641d2b629f5a798fffd to your computer and use it in GitHub Desktop.
using nanobrowser with local llm

install inference engine

install llama.cpp: go to https://github.com/ggml-org/llama.cpp and follow the instructions of your platform. Alternatively you can use other inference engines of your choice.

install OpenAI compatible layer

install llama-swap: https://github.com/mostlygeek/llama-swap and follow the instruction

download model

Best model which fits in 12GB VRAM to date is https://huggingface.co/prithivMLmods/Ophiuchi-Qwen3-14B-Instruct choose a quantization which fits in the VRAM and still has enough room fo the context. Nanabrowser uses a lot of tokens (>10K).

configuration

llama-swap

"qwen3":
    ttl: 300
    cmd: /home/ai/llama.cpp/build/bin/llama-server --port 9027
      --flash-attn --metrics
      -ctk q8_0
      -ctv q8_0
      --slots
      --model Ophiuchi-Qwen3-14B-Instruct.i1-Q4_K_S.gguf
      -ngl 49
      --ctx-size 12000
      --top_k 1
      --cache-reuse 256
      --jinja
    proxy: http://127.0.0.1:9027

nanobrowser

  • temp 0.7
  • topp 0.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment