Skip to content

Instantly share code, notes, and snippets.

@ricog
Last active April 13, 2026 16:22
Show Gist options
  • Select an option

  • Save ricog/96ba9cc295a4706f548ad0d2f7ab616b to your computer and use it in GitHub Desktop.

Select an option

Save ricog/96ba9cc295a4706f548ad0d2f7ab616b to your computer and use it in GitHub Desktop.
AMD Ryzen AI MAX+ 395 — LLM Benchmark Results (Ollama & Lemonade)

AMD Ryzen AI MAX+ 395 — LLM Benchmark Results

Hardware: Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151) / 96 GiB VRAM (LPDDR5X-8000, 256 GB/s) Host: Proxmox 9.x Last updated: 2026-04-13

Recommended Model Assignments

Purpose Model Server Notes
Embeddings qwen3-embedding:8b (4096 dims) Lemonade MTEB 70.6, code-aware, 40K context
Extraction LLM Qwen3.5 35B (MoE) Lemonade 55.4 tok/s, fast + accurate structured output
Code generation Qwen3-Coder 30B (MoE) Lemonade 77.3 tok/s, 256K context
Deep code review Qwen2.5-Coder 32B Lemonade 11.1 tok/s, dense reasoning
Business/strategy Qwen3.5 35B (MoE) Lemonade 55.4 tok/s, thinking mode
Deep reasoning Qwen3.5 122B (MoE) Lemonade 21.7 tok/s, strongest model on hardware
Quick tasks Gemma 4 E2B Lemonade 102.9 tok/s
Memory retrieval nomic-embed-text Ollama (104) Existing mem0 setup

Lemonade Server (v10.2.0) — LXC 114 (Ubuntu 24.04, ROCm gfx1151-specific binary)

Model Type Size Speed Notes
Gemma 4 E2B Dense 2B 3.3 GB 102.9 tok/s Fastest model, multimodal
Qwen3-Coder 30B MoE 3B active 20 GB 77.3 tok/s Best fast coder
Gemma 4 E4B Dense 4B 5.4 GB 55.5 tok/s Quick tasks + vision
Qwen3.5 35B MoE 3B active 21 GB 55.4 tok/s Best general MoE
Gemma 4 26B MoE MoE 4B active 18 GB 49.5 tok/s Multimodal MoE
DeepSeek-R1-0528-Qwen3-8B Dense 8B 5.3 GB 41.7 tok/s Fast reasoning (R1 distill)
Qwen3 8B Dense 8B 5.6 GB 40.3 tok/s Quick reasoning
Qwen3.5 122B MoE 10B active 73 GB 21.7 tok/s Strongest reasoning
Llama 4 Scout MoE 109B 66 GB 19.3 tok/s Multimodal, large context
Devstral Small 2 Dense 24B 15 GB 15.1 tok/s Agentic SWE
Qwen2.5-Coder 32B Dense 32B 21 GB 11.1 tok/s Deep code review
Gemma 4 31B Dense 31B 20 GB 10.9 tok/s General (crashes on Ollama)

Embedding Models

Model Dims MTEB Size Latency (warm)
qwen3-embedding 0.6b (Q8_0) 1024 ~60 610 MB ~75ms
qwen3-embedding 8b (Q4_K_M) 4096 70.6 4.7 GB ~300ms

Ollama (v0.x) — LXC 104 (Ubuntu 24.04, ROCm via system install)

Model Type Size Speed Notes
Gemma 4 E2B Dense 2B 7.2 GB 80.7 tok/s
Qwen3.5 35B MoE 3B active 24 GB 41.7 tok/s
Qwen3 8B Dense 8B 5.2 GB 38.5 tok/s
Qwen3.5 9B Dense 9B 6.6 GB 31.6 tok/s
phi4-reasoning:plus Dense 14B 11 GB 19.0 tok/s
Qwen2.5-Coder 32B Dense 32B 20 GB 11.0 tok/s
Qwen3 32B Dense 32B 20 GB 10.2 tok/s
Gemma 4 E2B (Vulkan) Dense 2B 7.2 GB 45.1 tok/s Vulkan backend on LXC 105

Ollama — Crashes (INT_MAX tensor bug in llama.cpp ROCm)

Model Size Error
Gemma 4 31B (Q4_K_M) 20 GB GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX)
Gemma 4 31B (Q8_0) 34 GB Same
qwen3-coder:30b 19 GB Same
devstral-small-2:24b 15 GB Same

Lemonade vs Ollama — Same Model Comparison

Model Lemonade Ollama Delta
Gemma 4 E2B 102.9 tok/s 80.7 tok/s +27% Lemonade
Qwen3 8B 37.6 tok/s 38.5 tok/s ~same
Qwen2.5-Coder 32B 11.1 tok/s 11.0 tok/s ~same
Qwen3.5 35B 55.4 tok/s 41.7 tok/s +33% Lemonade

Lemonade's gfx1151-specific ROCm binary is significantly faster for some models and also avoids the INT_MAX crash that affects Ollama.

Test Methodology

  • Prompt: "Explain quantum entanglement in two sentences. /no_think"
  • Non-streaming, single request, exclusive GPU access (other instances stopped)
  • Speed = eval_count / (eval_duration / 1e9) from Ollama-compatible API
  • Load times excluded (models pre-warmed where load shows 0.0s)
  • Embedding latency measured via curl wall-clock time, 5 runs, first run excluded (cold load)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment