Documentation of the calibration process and llama.cpp configuration with Gemma 4 E2B (Q3_K_M) on an NVIDIA Jetson Orin Nano (8GB).
| Component | Specification |
|---|---|
| Device | NVIDIA Jetson Orin Nano 8GB |
| CPU | ARM Cortex-A78AE, 6 cores |
| GPU | NVIDIA Ampere, 1024 CUDA cores, compute capability 8.7 |
| RAM | 7.4 GiB (unified, shared CPU/GPU) |
| JetPack | 6.2 (L4T r36.4, CUDA 12.6) |
| Storage | NVMe SSD |
I have manually downloaded the models optimized by unsloth
| Property | Value |
|---|---|
| Model | google/gemma-4-E2B-it (Effective 2B) |
| Total params | 4.65B |
| File | gemma-4-E2B-it-Q3_K_M.gguf (2.35 GiB) |
| Quantization | Q3_K_M (4.34 BPW) |
| Layers | 35 |
| Max context | 131072 tokens |
| License | Apache 2.0 |
Assuming you had installed Ollama like me at first.
sudo systemctl stop ollama
echo 3 | sudo tee /proc/sys/vm/drop_cachesfor cpu in /sys/devices/system/cpu/cpu[0-5]/cpufreq/scaling_governor; do
echo performance | sudo tee $cpu
doneYeah, you need docker at this point
docker pull ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orinContext sizes tested with -fit off to force allocation:
| ctx-size | KV cache (GPU) | GPU VRAM used | Status | Gen speed |
|---|---|---|---|---|
| 2048 | 36 MiB | ~1.7 GiB | ✅ | ~17 tok/s |
| 4096 | 72 MiB | ~1.8 GiB | ✅ | — |
| 8192 | 144 MiB | ~1.9 GiB | ✅ | — |
| 16384 | 288 MiB | ~2.0 GiB | ✅ | — |
| 32768 | 576 MiB | ~2.2 GiB | ✅ | — |
| 65536 | 1152 MiB (est.) | ~2.3 GiB | ✅ | ~17 tok/s |
| 131072 | 798 MiB | ~2.5 GiB | ✅ | 16.3 tok/s |
Conclusion: The maximum context of 131072 tokens loads without issues with all 36/36 layers on GPU, leaving ~5 GiB of free VRAM.
docker run -d --restart unless-stopped --runtime=nvidia \
--network host --name llama-server \
-v ~/gguf:/models \
-e GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
llama-server \
-m /models/gemma-4-E2B-it-Q3_K_M.gguf \
-ngl 99 \
--ctx-size 131072 \
--flash-attn on \
-fit off \
--host 0.0.0.0 \
--port 11434[Unit]
Description=llama.cpp server for Gemma 4 E2B
After=docker.service
Requires=docker.service
StartLimitIntervalSec=0
[Service]
Type=exec
User=$USER
ExecStartPre=-/usr/bin/docker kill llama-server
ExecStartPre=-/usr/bin/docker rm llama-server
ExecStart=/usr/bin/docker run --rm --runtime=nvidia --network host \
--name llama-server \
-v /home/$USER/gguf:/models \
-e GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
llama-server \
-m /models/gemma-4-E2B-it-Q3_K_M.gguf \
-ngl 99 \
--ctx-size 131072 \
--flash-attn on \
-fit off \
--host 0.0.0.0 \
--port 11434
ExecStop=/usr/bin/docker stop llama-server
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target| Argument / Flag | Explanation |
|---|---|
docker run |
Create and start a new container |
--rm |
Automatically remove the container when it exits |
--runtime=nvidia |
Use the NVIDIA container runtime for GPU access |
--network host |
Use the host's network stack (container shares host IP/ports) |
--name llama-server |
Assign the container the name llama-server |
-v /home/$USER/gguf:/models |
Mount host directory ~/gguf to /models inside container |
-e GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 |
Set env var — enables CUDA unified memory (allows GPU to use system RAM as overflow) |
ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin |
The container image — built for Jetson Orin with llama.cpp |
llama-server |
The entrypoint binary inside the container |
-m /models/gemma-4-E2B-it-Q3_K_M.gguf |
Path to the GGUF model file inside the container |
-ngl 99 |
Offload 99 layers to the GPU (-ngl = num GPU layers) |
--ctx-size 131072 |
Context window size of 131,072 tokens |
--flash-attn on |
Enable Flash Attention for faster inference with less memory |
-fit off |
Disable "fit to context" (don't trim prompt to fit context window) |
--host 0.0.0.0 |
Bind the server to all network interfaces |
--port 11434 |
Listen on port 11434 (ollama-compatible port) |
sudo systemctl enable llama-server.service # Auto-start on boot
sudo systemctl start llama-server.service # Start now
sudo systemctl stop llama-server.service # Stop
sudo systemctl status llama-server.service # Check status
sudo journalctl -u llama-server.service -f # Follow real-time logsTest with 100 generated tokens:
echo '{"prompt": "The theory of relativity", "n_predict": 100}' > /tmp/payload.json
curl -s -X POST http://localhost:11434/completion \
-H "Content-Type: application/json" \
-d @/tmp/payload.json | python3 -m json.toolResults:
| Metric | Ollama | llama.cpp (direct) | Improvement |
|---|---|---|---|
| Generation (tok/s) | 4.54 | 16.3 | 3.6x |
| Prompt eval (tok/s) | 511 | 34.6 (5 tok) | — |
| Max context | 8192 | 131072 | 16x |
| First token latency | ~1.7s | ~0.15s | 11x |
| Model load time | ~4.3s | ~6s | — |
| API endpoint | /api/chat | /completion | OpenAI compatible |
| Flag | Purpose |
|---|---|
-ngl 99 |
Offload all layers to GPU (36/36) |
--ctx-size 131072 |
Max model context |
--flash-attn on |
Enable Flash Attention (reduces KV cache memory) |
-fit off |
Disable conservative memory auto-fitting |
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 |
Use unified memory on Jetson |
--no-warmup |
Skip initial warmup (faster startup) |
# Force MMQ (matrix multiply quantized) if output is garbled
-e GGML_CUDA_FORCE_MMQ=1
# Disable VMM if virtual memory errors occur
-e GGML_CUDA_NO_VMM=1| Approach | Speed | Effort | GGUF Support |
|---|---|---|---|
| llama.cpp direct (chosen) | 16.3 tok/s | Low | ✅ Native |
| Ollama | 4.54 tok/s | Minimal | ✅ (with overhead) |
| vLLM container (experimental) | ~10-15 tok/s (est.) | Medium | ❌ Experimental |
| TensorRT-LLM | Best possible | Very High | ❌ Requires conversion |
Using OpenAI Api compatible llama.cpp server
- OpenAI API URL:
http://localhost:11434/v1 - Model:
gemma-4-E2B-it-Q3_K_M.gguf
- The Q3_K_M GGUF file (2.35 GiB) is the sweet spot between speed and quality for 8GB of unified RAM.
- The Q4_K_M alternative (2.9 GiB) is slightly slower (~4 tok/s less) but offers better perplexity.
- No HuggingFace authentication required (Apache 2.0 license).
- Monitor with
tegrastatsorjtopto verify resource usage.
Document generated on May 8, 2026 after calibration on real hardware.