Optimal Gemma 4 E2B Configuration on Jetson Orin Nano

Documentation of the calibration process and llama.cpp configuration with Gemma 4 E2B (Q3_K_M) on an NVIDIA Jetson Orin Nano (8GB).

Hardware

Component	Specification
Device	NVIDIA Jetson Orin Nano 8GB
CPU	ARM Cortex-A78AE, 6 cores
GPU	NVIDIA Ampere, 1024 CUDA cores, compute capability 8.7
RAM	7.4 GiB (unified, shared CPU/GPU)
JetPack	6.2 (L4T r36.4, CUDA 12.6)
Storage	NVMe SSD

Model

I have manually downloaded the models optimized by unsloth

Property	Value
Model	google/gemma-4-E2B-it (Effective 2B)
Total params	4.65B
File	gemma-4-E2B-it-Q3_K_M.gguf (2.35 GiB)
Quantization	Q3_K_M (4.34 BPW)
Layers	35
Max context	131072 tokens
License	Apache 2.0

Calibration Process

Assuming you had installed Ollama like me at first.

1. Free Resources

sudo systemctl stop ollama
echo 3 | sudo tee /proc/sys/vm/drop_caches

2. Set CPU Governor to Max Performance

for cpu in /sys/devices/system/cpu/cpu[0-5]/cpufreq/scaling_governor; do
  echo performance | sudo tee $cpu
done

3. Pull NVIDIA's Jetson-Optimized Container

Yeah, you need docker at this point

docker pull ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin

4. Context Size Testing (Binary Search)

Context sizes tested with -fit off to force allocation:

ctx-size	KV cache (GPU)	GPU VRAM used	Status	Gen speed
2048	36 MiB	~1.7 GiB	✅	~17 tok/s
4096	72 MiB	~1.8 GiB	✅	—
8192	144 MiB	~1.9 GiB	✅	—
16384	288 MiB	~2.0 GiB	✅	—
32768	576 MiB	~2.2 GiB	✅	—
65536	1152 MiB (est.)	~2.3 GiB	✅	~17 tok/s
131072	798 MiB	~2.5 GiB	✅	16.3 tok/s

Conclusion: The maximum context of 131072 tokens loads without issues with all 36/36 layers on GPU, leaving ~5 GiB of free VRAM.

Final Configuration

Container (direct docker run)

docker run -d --restart unless-stopped --runtime=nvidia \
  --network host --name llama-server \
  -v ~/gguf:/models \
  -e GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
  llama-server \
    -m /models/gemma-4-E2B-it-Q3_K_M.gguf \
    -ngl 99 \
    --ctx-size 131072 \
    --flash-attn on \
    -fit off \
    --host 0.0.0.0 \
    --port 11434

Systemd Service (`/etc/systemd/system/llama-server.service`)

[Unit]
Description=llama.cpp server for Gemma 4 E2B
After=docker.service
Requires=docker.service
StartLimitIntervalSec=0

[Service]
Type=exec
User=$USER
ExecStartPre=-/usr/bin/docker kill llama-server
ExecStartPre=-/usr/bin/docker rm llama-server
ExecStart=/usr/bin/docker run --rm --runtime=nvidia --network host \
  --name llama-server \
  -v /home/$USER/gguf:/models \
  -e GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
  llama-server \
    -m /models/gemma-4-E2B-it-Q3_K_M.gguf \
    -ngl 99 \
    --ctx-size 131072 \
    --flash-attn on \
    -fit off \
    --host 0.0.0.0 \
    --port 11434
ExecStop=/usr/bin/docker stop llama-server
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

What we run

Argument / Flag	Explanation
`docker run`	Create and start a new container
`--rm`	Automatically remove the container when it exits
`--runtime=nvidia`	Use the NVIDIA container runtime for GPU access
`--network host`	Use the host's network stack (container shares host IP/ports)
`--name llama-server`	Assign the container the name `llama-server`
`-v /home/$USER/gguf:/models`	Mount host directory `~/gguf` to `/models` inside container
`-e GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`	Set env var — enables CUDA unified memory (allows GPU to use system RAM as overflow)
`ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin`	The container image — built for Jetson Orin with llama.cpp
`llama-server`	The entrypoint binary inside the container
`-m /models/gemma-4-E2B-it-Q3_K_M.gguf`	Path to the GGUF model file inside the container
`-ngl 99`	Offload 99 layers to the GPU (`-ngl` = num GPU layers)
`--ctx-size 131072`	Context window size of 131,072 tokens
`--flash-attn on`	Enable Flash Attention for faster inference with less memory
`-fit off`	Disable "fit to context" (don't trim prompt to fit context window)
`--host 0.0.0.0`	Bind the server to all network interfaces
`--port 11434`	Listen on port 11434 (ollama-compatible port)

Service Management

sudo systemctl enable llama-server.service   # Auto-start on boot
sudo systemctl start llama-server.service    # Start now
sudo systemctl stop llama-server.service     # Stop
sudo systemctl status llama-server.service   # Check status
sudo journalctl -u llama-server.service -f   # Follow real-time logs

Benchmark

Test with 100 generated tokens:

echo '{"prompt": "The theory of relativity", "n_predict": 100}' > /tmp/payload.json
curl -s -X POST http://localhost:11434/completion \
  -H "Content-Type: application/json" \
  -d @/tmp/payload.json | python3 -m json.tool

Results:

Metric	Ollama	llama.cpp (direct)	Improvement
Generation (tok/s)	4.54	16.3	3.6x
Prompt eval (tok/s)	511	34.6 (5 tok)	—
Max context	8192	131072	16x
First token latency	~1.7s	~0.15s	11x
Model load time	~4.3s	~6s	—
API endpoint	/api/chat	/completion	OpenAI compatible

Key Flags Explained

Flag	Purpose
`-ngl 99`	Offload all layers to GPU (36/36)
`--ctx-size 131072`	Max model context
`--flash-attn on`	Enable Flash Attention (reduces KV cache memory)
`-fit off`	Disable conservative memory auto-fitting
`GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`	Use unified memory on Jetson
`--no-warmup`	Skip initial warmup (faster startup)

Troubleshooting Environment Variables

# Force MMQ (matrix multiply quantized) if output is garbled
-e GGML_CUDA_FORCE_MMQ=1

# Disable VMM if virtual memory errors occur
-e GGML_CUDA_NO_VMM=1

Alternative Approaches Comparison

Approach	Speed	Effort	GGUF Support
llama.cpp direct (chosen)	16.3 tok/s	Low	✅ Native
Ollama	4.54 tok/s	Minimal	✅ (with overhead)
vLLM container (experimental)	~10-15 tok/s (est.)	Medium	❌ Experimental
TensorRT-LLM	Best possible	Very High	❌ Requires conversion

Integration with clients

Using OpenAI Api compatible llama.cpp server

OpenAI API URL: http://localhost:11434/v1
Model: gemma-4-E2B-it-Q3_K_M.gguf

Additional Notes

The Q3_K_M GGUF file (2.35 GiB) is the sweet spot between speed and quality for 8GB of unified RAM.
The Q4_K_M alternative (2.9 GiB) is slightly slower (~4 tok/s less) but offers better perplexity.
No HuggingFace authentication required (Apache 2.0 license).
Monitor with tegrastats or jtop to verify resource usage.

Document generated on May 8, 2026 after calibration on real hardware.

andrescevp/JETSON_NANO_GEMMA4_130k_CONTEXT.MD

Select an option

No results found

Select an option

No results found

Optimal Gemma 4 E2B Configuration on Jetson Orin Nano

Hardware

Model

Calibration Process

1. Free Resources

2. Set CPU Governor to Max Performance

3. Pull NVIDIA's Jetson-Optimized Container

4. Context Size Testing (Binary Search)

Final Configuration

Container (direct docker run)

Systemd Service (`/etc/systemd/system/llama-server.service`)

What we run

Service Management

Benchmark

Key Flags Explained

Troubleshooting Environment Variables

Alternative Approaches Comparison

Integration with clients

Additional Notes

andrescevp/JETSON_NANO_GEMMA4_130k_CONTEXT.MD

Optimal Gemma 4 E2B Configuration on Jetson Orin Nano

Hardware

Model

Calibration Process

1. Free Resources

2. Set CPU Governor to Max Performance

3. Pull NVIDIA's Jetson-Optimized Container

4. Context Size Testing (Binary Search)

Final Configuration

Container (direct docker run)

Systemd Service (/etc/systemd/system/llama-server.service)

What we run

Service Management

Benchmark

Key Flags Explained

Troubleshooting Environment Variables

Alternative Approaches Comparison

Integration with clients

Additional Notes

Systemd Service (`/etc/systemd/system/llama-server.service`)