Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save andrescevp/24677e6ba8c87bb7a90b487ebc02453d to your computer and use it in GitHub Desktop.

Select an option

Save andrescevp/24677e6ba8c87bb7a90b487ebc02453d to your computer and use it in GitHub Desktop.
Optimal Gemma 4 E2B Configuration on Jetson Orin Nano

Optimal Gemma 4 E2B Configuration on Jetson Orin Nano

Documentation of the calibration process and llama.cpp configuration with Gemma 4 E2B (Q3_K_M) on an NVIDIA Jetson Orin Nano (8GB).


Hardware

Component Specification
Device NVIDIA Jetson Orin Nano 8GB
CPU ARM Cortex-A78AE, 6 cores
GPU NVIDIA Ampere, 1024 CUDA cores, compute capability 8.7
RAM 7.4 GiB (unified, shared CPU/GPU)
JetPack 6.2 (L4T r36.4, CUDA 12.6)
Storage NVMe SSD

Model

I have manually downloaded the models optimized by unsloth

Property Value
Model google/gemma-4-E2B-it (Effective 2B)
Total params 4.65B
File gemma-4-E2B-it-Q3_K_M.gguf (2.35 GiB)
Quantization Q3_K_M (4.34 BPW)
Layers 35
Max context 131072 tokens
License Apache 2.0

Calibration Process

Assuming you had installed Ollama like me at first.

1. Free Resources

sudo systemctl stop ollama
echo 3 | sudo tee /proc/sys/vm/drop_caches

2. Set CPU Governor to Max Performance

for cpu in /sys/devices/system/cpu/cpu[0-5]/cpufreq/scaling_governor; do
  echo performance | sudo tee $cpu
done

3. Pull NVIDIA's Jetson-Optimized Container

Yeah, you need docker at this point

docker pull ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin

4. Context Size Testing (Binary Search)

Context sizes tested with -fit off to force allocation:

ctx-size KV cache (GPU) GPU VRAM used Status Gen speed
2048 36 MiB ~1.7 GiB ~17 tok/s
4096 72 MiB ~1.8 GiB
8192 144 MiB ~1.9 GiB
16384 288 MiB ~2.0 GiB
32768 576 MiB ~2.2 GiB
65536 1152 MiB (est.) ~2.3 GiB ~17 tok/s
131072 798 MiB ~2.5 GiB 16.3 tok/s

Conclusion: The maximum context of 131072 tokens loads without issues with all 36/36 layers on GPU, leaving ~5 GiB of free VRAM.

Final Configuration

Container (direct docker run)

docker run -d --restart unless-stopped --runtime=nvidia \
  --network host --name llama-server \
  -v ~/gguf:/models \
  -e GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
  llama-server \
    -m /models/gemma-4-E2B-it-Q3_K_M.gguf \
    -ngl 99 \
    --ctx-size 131072 \
    --flash-attn on \
    -fit off \
    --host 0.0.0.0 \
    --port 11434

Systemd Service (/etc/systemd/system/llama-server.service)

[Unit]
Description=llama.cpp server for Gemma 4 E2B
After=docker.service
Requires=docker.service
StartLimitIntervalSec=0

[Service]
Type=exec
User=$USER
ExecStartPre=-/usr/bin/docker kill llama-server
ExecStartPre=-/usr/bin/docker rm llama-server
ExecStart=/usr/bin/docker run --rm --runtime=nvidia --network host \
  --name llama-server \
  -v /home/$USER/gguf:/models \
  -e GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
  llama-server \
    -m /models/gemma-4-E2B-it-Q3_K_M.gguf \
    -ngl 99 \
    --ctx-size 131072 \
    --flash-attn on \
    -fit off \
    --host 0.0.0.0 \
    --port 11434
ExecStop=/usr/bin/docker stop llama-server
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

What we run

Argument / Flag Explanation
docker run Create and start a new container
--rm Automatically remove the container when it exits
--runtime=nvidia Use the NVIDIA container runtime for GPU access
--network host Use the host's network stack (container shares host IP/ports)
--name llama-server Assign the container the name llama-server
-v /home/$USER/gguf:/models Mount host directory ~/gguf to /models inside container
-e GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 Set env var — enables CUDA unified memory (allows GPU to use system RAM as overflow)
ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin The container image — built for Jetson Orin with llama.cpp
llama-server The entrypoint binary inside the container
-m /models/gemma-4-E2B-it-Q3_K_M.gguf Path to the GGUF model file inside the container
-ngl 99 Offload 99 layers to the GPU (-ngl = num GPU layers)
--ctx-size 131072 Context window size of 131,072 tokens
--flash-attn on Enable Flash Attention for faster inference with less memory
-fit off Disable "fit to context" (don't trim prompt to fit context window)
--host 0.0.0.0 Bind the server to all network interfaces
--port 11434 Listen on port 11434 (ollama-compatible port)

Service Management

sudo systemctl enable llama-server.service   # Auto-start on boot
sudo systemctl start llama-server.service    # Start now
sudo systemctl stop llama-server.service     # Stop
sudo systemctl status llama-server.service   # Check status
sudo journalctl -u llama-server.service -f   # Follow real-time logs

Benchmark

Test with 100 generated tokens:

echo '{"prompt": "The theory of relativity", "n_predict": 100}' > /tmp/payload.json
curl -s -X POST http://localhost:11434/completion \
  -H "Content-Type: application/json" \
  -d @/tmp/payload.json | python3 -m json.tool

Results:

Metric Ollama llama.cpp (direct) Improvement
Generation (tok/s) 4.54 16.3 3.6x
Prompt eval (tok/s) 511 34.6 (5 tok)
Max context 8192 131072 16x
First token latency ~1.7s ~0.15s 11x
Model load time ~4.3s ~6s
API endpoint /api/chat /completion OpenAI compatible

Key Flags Explained

Flag Purpose
-ngl 99 Offload all layers to GPU (36/36)
--ctx-size 131072 Max model context
--flash-attn on Enable Flash Attention (reduces KV cache memory)
-fit off Disable conservative memory auto-fitting
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 Use unified memory on Jetson
--no-warmup Skip initial warmup (faster startup)

Troubleshooting Environment Variables

# Force MMQ (matrix multiply quantized) if output is garbled
-e GGML_CUDA_FORCE_MMQ=1

# Disable VMM if virtual memory errors occur
-e GGML_CUDA_NO_VMM=1

Alternative Approaches Comparison

Approach Speed Effort GGUF Support
llama.cpp direct (chosen) 16.3 tok/s Low ✅ Native
Ollama 4.54 tok/s Minimal ✅ (with overhead)
vLLM container (experimental) ~10-15 tok/s (est.) Medium ❌ Experimental
TensorRT-LLM Best possible Very High ❌ Requires conversion

Integration with clients

Using OpenAI Api compatible llama.cpp server

  • OpenAI API URL: http://localhost:11434/v1
  • Model: gemma-4-E2B-it-Q3_K_M.gguf

Additional Notes

  • The Q3_K_M GGUF file (2.35 GiB) is the sweet spot between speed and quality for 8GB of unified RAM.
  • The Q4_K_M alternative (2.9 GiB) is slightly slower (~4 tok/s less) but offers better perplexity.
  • No HuggingFace authentication required (Apache 2.0 license).
  • Monitor with tegrastats or jtop to verify resource usage.

Document generated on May 8, 2026 after calibration on real hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment