"All things leave behind them the Obscurity... and go forward to embrace the Brightness..." — Dao De Jing #42
- Q: Who provides the best GGUFs now?
- A: They're all pretty good.
Skip down if you just want graphs and numbers comparing various Qwen3-30B-A3B GGUF quants.
It's been well over a year since TheBloke uploaded his last quant to huggingface. The LLM landscape has changed markedly since then with many new models being released monthly, new inference engines targeting specific hardware optimizations, and ongoing evolution of quantization algorithims. Our community continues to grow and diversify at an amazing rate.
Fortunately, many folks and organizations have kindly stepped-up to keep the quants cooking so we can all find an LLM sized just right to fit on our home rigs. Amongst them bartowski, and unsloth (Daniel and Michael's start-up company), have become the new "household names" for providing a variety of GGUF quantizations for popular model releases and even all those wild creative fine-tunes! (There are many more including team mradermacher and too many to list everyone, sorry!)
Until recently most GGUF style quants' recipes were "static" meaning that
all the tensors and layers were quantized the same e.g. Q8_0
or with
consistent patterns defined in llama.cpp's code. So all quants of a given size were
mostly the same regardless of who cooked and uploaded it to huggingface.
Things began to change over a year ago with major advancements
like importance matrix quantizations by ikawrakow in llama.cpp PR#4861
as well as new quant types (like the perennial favorite IQ4_XS)
which have become the mainstay for users of llama.cpp, ollama, koboldcpp,
lmstudio, etc. The entire GGUF ecosystem owes a big thanks to not just to
ggerganov
but also ikawrakow
(as well as the many more contributors).
Very recently unsloth introduced a few changes to their quantization methodology that combine different imatrix calibration texts and context lengths along with making some tensors/layers different sizes than the regular llama.cpp code (they had a public fork with their branch, but have to update and re-push due to upstream changes). They have named this change in standard methodology Unsloth Dynamic 2.0 GGUFs as part of their start-up company's marketing strategy.
Around the same time bartowski has been experimenting with different
imatrix calibration texts and opened a PR to llama.cpp modifying the
default tensor/layer quantization recipes. I myself began experimenting
with custom "dynamic" quantization recipes using ikawrakow's latest SOTA
quants like iq4_k
which to-date only work on his ik_llama.cpp
fork.
While this is great news for all GGUF enjoyers, the friendly competition and additional options have led to some confusion and I dare say some "tribalism". (If part of your identity as a person depends on downloading quants from only one source, I suggest you google: "Nan Yar?").
So how can you, dear reader, decide which is the best quant of a given model for you to download? unsloth already did a great blog post discussing their own benchmarks and metrics. Open a tab to check out u/AaronFeng47's many other benchmarks. And finally, this post contains even more metrics and benchmarks. The best answer I have is "Nullius in verba, (Latin for "take nobody's word for it") — even my word!
Unfortunately, this means there is no one-size-fits-all rule, "X" is not always better than "Y", and if you want to min-max-optimize your LLM for your specific use case on your specific hardware you probably will have to experiment and think critically. If you don't care too much, then pick the any of biggest quants that fit on your rig for the desired context length and you'll be fine because: they're all pretty good.
And with that, let's dive into the Qwen3-30B-A3B benchmarks below!
Shout out to Wendell and the Level1Techs crew, the L1T Forums, and the L1T YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make great quants available to the community!!!
👈 Qwen3-30B-A3B Benchmark Suite Graphs
Note <think>
mode was disabled for these tests to speed up benchmarking.
👈 Qwen3-30B-A3B Perplexity and KLD Graphs
Using the BF16
as baseline for KLD stats. Also note the perplexity was lowest ("best") for models other than the bf16
which is not typically the case unless there was possibly some QAT going on. As such, the chart is relative to the lowest perplexity score: PPL/min(PPL)-1
plus a small eps for scaling.
wiki.test.raw
(lower is "better")
ubergarm-kdl-test-corpus.txt
(lower is "better")
(lower is "better")
(lower is "better")
👈 Qwen3-235B-A22B Perplexity and KLD Graphs
Not as many data points here but just for comparison. Keep in mind the Q8_0
was the baseline for KLD stats given I couldn't easily run the full BF16
.
wiki.test.raw
(lower is "better")
ubergarm-kdl-test-corpus.txt
(lower is "better")
(lower is "better")
(lower is "better")
👈 Qwen3-30B-A3B Speed llama-sweep-bench Graphs
llama.cpp
ik_llama.cpp
NOTE: Keep in mind ik's fork is faster than mainline llama.cpp for many architectures and configurations especially only-CPU, hybrid-CPU+GPU, and DeepSeek MLA cases.
👈 Perplexity, KLD, and imatrix Methodology
PPL and KLD testing done with ik_llama.cpp@9ba36270
.
I adjust ngl and threads for larger 235B models.
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-perplexity \
-m "$model" \
--ctx-size 512 \
--ubatch-size 512 \
-f wiki.test.raw \
-fa \
-ngl 99 \
--seed 1337 \
--threads 1
I adjust ngl and threads for larger 235B models.
For 235B I had to use the Q8_0
as the baseline given this rig can't easily run the full 400+GiB BF16.
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-perplexity \
-m "$model" \
--kl-divergence-base /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-BF16-ubergarm-kld-test-corpus-base.dat \
--kl-divergence \
-f ubergarm-kld-test-corpus.txt \
-fa \
-ngl 99 \
--seed 1337 \
--threads 1
This is how I make my imatrix using ik_llama.cpp
to additionaly print out cosine similarity data to inform possible custom quant strategies. I haven't seen how exactly unsloth makes their new recipe.
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-imatrix \
--verbosity 1 \
--layer-similarity \
-m /mnt/raid/models/Qwen/Qwen3-30B-A3B/Qwen3-30B-A3B-BF16-00001-of-00002.gguf \
-f calibration_data_v5_rc.txt \
-o /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/imatrix-Qwen3-30B-A3B.dat \
--ctx-size 512 \
-ngl 36 \
--threads 16
======================== sorted layer importances
0: Layer 0, <cos_sim> = 0.32154
1: Layer 47, <cos_sim> = 0.38473
2: Layer 1, <cos_sim> = 0.736987
3: Layer 28, <cos_sim> = 0.845492
4: Layer 2, <cos_sim> = 0.847391
5: Layer 29, <cos_sim> = 0.859291
6: Layer 7, <cos_sim> = 0.861405
7: Layer 3, <cos_sim> = 0.878313
8: Layer 8, <cos_sim> = 0.893971
9: Layer 6, <cos_sim> = 0.900308
10: Layer 42, <cos_sim> = 0.911525
11: Layer 5, <cos_sim> = 0.912156
12: Layer 17, <cos_sim> = 0.913169
13: Layer 4, <cos_sim> = 0.914095
14: Layer 13, <cos_sim> = 0.92175
15: Layer 46, <cos_sim> = 0.925283
16: Layer 19, <cos_sim> = 0.926845
17: Layer 18, <cos_sim> = 0.927019
18: Layer 45, <cos_sim> = 0.928896
19: Layer 40, <cos_sim> = 0.934481
20: Layer 31, <cos_sim> = 0.934585
21: Layer 14, <cos_sim> = 0.936932
22: Layer 16, <cos_sim> = 0.940338
23: Layer 25, <cos_sim> = 0.940477
24: Layer 10, <cos_sim> = 0.942312
25: Layer 38, <cos_sim> = 0.943166
26: Layer 9, <cos_sim> = 0.943843
27: Layer 11, <cos_sim> = 0.944233
28: Layer 37, <cos_sim> = 0.944325
29: Layer 20, <cos_sim> = 0.94612
30: Layer 22, <cos_sim> = 0.946449
31: Layer 41, <cos_sim> = 0.946775
32: Layer 39, <cos_sim> = 0.947228
33: Layer 44, <cos_sim> = 0.947687
34: Layer 30, <cos_sim> = 0.947942
35: Layer 23, <cos_sim> = 0.949102
36: Layer 12, <cos_sim> = 0.951618
37: Layer 21, <cos_sim> = 0.951701
38: Layer 24, <cos_sim> = 0.952261
39: Layer 43, <cos_sim> = 0.953357
40: Layer 27, <cos_sim> = 0.953528
41: Layer 26, <cos_sim> = 0.95575
42: Layer 32, <cos_sim> = 0.956024
43: Layer 15, <cos_sim> = 0.956915
44: Layer 35, <cos_sim> = 0.959861
45: Layer 36, <cos_sim> = 0.960591
46: Layer 34, <cos_sim> = 0.961539
47: Layer 33, <cos_sim> = 0.968161
======================== sorted attention importances
0: Layer 0, <cos_sim> = 0.353019
1: Layer 45, <cos_sim> = 0.638476
2: Layer 1, <cos_sim> = 0.674894
3: Layer 29, <cos_sim> = 0.686547
4: Layer 17, <cos_sim> = 0.708034
5: Layer 3, <cos_sim> = 0.718456
6: Layer 21, <cos_sim> = 0.72082
7: Layer 44, <cos_sim> = 0.732611
8: Layer 22, <cos_sim> = 0.738435
9: Layer 18, <cos_sim> = 0.742531
10: Layer 42, <cos_sim> = 0.745018
11: Layer 8, <cos_sim> = 0.746792
12: Layer 24, <cos_sim> = 0.750162
13: Layer 23, <cos_sim> = 0.750384
14: Layer 9, <cos_sim> = 0.754324
15: Layer 46, <cos_sim> = 0.758528
16: Layer 33, <cos_sim> = 0.76019
17: Layer 47, <cos_sim> = 0.760449
18: Layer 27, <cos_sim> = 0.760966
19: Layer 4, <cos_sim> = 0.761774
20: Layer 2, <cos_sim> = 0.762337
21: Layer 6, <cos_sim> = 0.763453
22: Layer 34, <cos_sim> = 0.765167
23: Layer 30, <cos_sim> = 0.768629
24: Layer 25, <cos_sim> = 0.768819
25: Layer 26, <cos_sim> = 0.769841
26: Layer 20, <cos_sim> = 0.77039
27: Layer 10, <cos_sim> = 0.772251
28: Layer 41, <cos_sim> = 0.773975
29: Layer 35, <cos_sim> = 0.774599
30: Layer 43, <cos_sim> = 0.775401
31: Layer 11, <cos_sim> = 0.776914
32: Layer 28, <cos_sim> = 0.778543
33: Layer 19, <cos_sim> = 0.781975
34: Layer 36, <cos_sim> = 0.78645
35: Layer 32, <cos_sim> = 0.790626
36: Layer 15, <cos_sim> = 0.795375
37: Layer 12, <cos_sim> = 0.797279
38: Layer 16, <cos_sim> = 0.797483
39: Layer 14, <cos_sim> = 0.797921
40: Layer 7, <cos_sim> = 0.80098
41: Layer 5, <cos_sim> = 0.802361
42: Layer 37, <cos_sim> = 0.805299
43: Layer 13, <cos_sim> = 0.806054
44: Layer 31, <cos_sim> = 0.807454
45: Layer 38, <cos_sim> = 0.808983
46: Layer 40, <cos_sim> = 0.813216
47: Layer 39, <cos_sim> = 0.816557
======================== sorted ffn importances
0: Layer 47, <cos_sim> = 0.613059
1: Layer 44, <cos_sim> = 0.630819
2: Layer 0, <cos_sim> = 0.653987
3: Layer 28, <cos_sim> = 0.686159
4: Layer 16, <cos_sim> = 0.693473
5: Layer 7, <cos_sim> = 0.694612
6: Layer 43, <cos_sim> = 0.710648
7: Layer 20, <cos_sim> = 0.71511
8: Layer 21, <cos_sim> = 0.715567
9: Layer 46, <cos_sim> = 0.71785
10: Layer 45, <cos_sim> = 0.718143
11: Layer 1, <cos_sim> = 0.726385
12: Layer 3, <cos_sim> = 0.735632
13: Layer 8, <cos_sim> = 0.736597
14: Layer 2, <cos_sim> = 0.737616
15: Layer 22, <cos_sim> = 0.739272
16: Layer 33, <cos_sim> = 0.739951
17: Layer 19, <cos_sim> = 0.740003
18: Layer 9, <cos_sim> = 0.742748
19: Layer 32, <cos_sim> = 0.747542
20: Layer 23, <cos_sim> = 0.749229
21: Layer 24, <cos_sim> = 0.755807
22: Layer 41, <cos_sim> = 0.75653
23: Layer 10, <cos_sim> = 0.757337
24: Layer 34, <cos_sim> = 0.758472
25: Layer 31, <cos_sim> = 0.759585
26: Layer 40, <cos_sim> = 0.763913
27: Layer 17, <cos_sim> = 0.768032
28: Layer 26, <cos_sim> = 0.768999
29: Layer 18, <cos_sim> = 0.771782
30: Layer 6, <cos_sim> = 0.776553
31: Layer 4, <cos_sim> = 0.777394
32: Layer 27, <cos_sim> = 0.777827
33: Layer 35, <cos_sim> = 0.778635
34: Layer 42, <cos_sim> = 0.779552
35: Layer 36, <cos_sim> = 0.779963
36: Layer 25, <cos_sim> = 0.785371
37: Layer 12, <cos_sim> = 0.785794
38: Layer 29, <cos_sim> = 0.787757
39: Layer 5, <cos_sim> = 0.79259
40: Layer 11, <cos_sim> = 0.793774
41: Layer 15, <cos_sim> = 0.796992
42: Layer 30, <cos_sim> = 0.797935
43: Layer 14, <cos_sim> = 0.7999
44: Layer 39, <cos_sim> = 0.806665
45: Layer 38, <cos_sim> = 0.813561
46: Layer 13, <cos_sim> = 0.820982
47: Layer 37, <cos_sim> = 0.830343
👈 Benchmarking Methodology
The benchmark client used is bartowski's patched evalchemy fork containing fixes for easier use across a variety of LLM server API endpoints.
Benchmark test suite testing done with llama.cpp@36667c8e
on a subset of models.
For llama.cpp
server:
cd llama.cpp
git checkout 36667c8e
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
model=/mnt/raid/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-IQ2_M.gguf
name=bartowski/Qwen3-30B-A3B-IQ2_M
CUDA_VISIBLE_DEVICES="1" \
./build/bin/llama-server \
--model "$model" \
--alias "$name" \
--api-key super-secret-change-me \
-fa \
-ctk f16 -ctv f16 \
-c 262144 \
--parallel 8 \
--slots \
-ngl 99 \
--threads 1 \
--host 127.0.0.1 \
--port 8088
For ik_llama.cpp
server:
cd ik_llama.cpp
git checkout e3fec173
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
model=/mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-mix-IQ4_K.gguf
name=ubergarm/Qwen3-30B-A3B-mix-IQ4_K
CUDA_VISIBLE_DEVICES="1" \
./build/bin/llama-server \
--model "$model" \
--alias "$name" \
--api-key super-secret-change-me \
-fmoe \
-fa \
-ctk f16 -ctv f16 \
-c 262144 \
--parallel 8 \
-ngl 99 \
--threads 1 \
--host 127.0.0.1 \
--port 8088
For vllm
server:
CUDA_VISIBLE_DEVICES="1" \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
VLLM_USE_MODELSCOPE=True \
vllm \
serve swift/Qwen3-30B-A3B-AWQ \
--served-model-name Qwen3-30B-A3B-AWQ \
--gpu-memory-utilization 0.9 \
--max-model-len 32768 \
--max-num-seqs 64 \
--api-key super-secret-change-me \
--host 127.0.0.1 \
--port 8080
👈 Speed Benchmark Methodology
Note probably no warmup, I saw a PR on ik's fork about it so the first data point trends low.cd llama.cpp
git ug/port-sweep-bench
# llama.cpp@814f795e + ug/port-sweep-bench
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
#model=/mnt/astrodata/llm/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
#model=/mnt/astrodata/llm/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-Q2_K_L.gguf
#model=/mnt/astrodata/llm/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-IQ2_M.gguf
#model=/mnt/astrodata/llm/models/unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-Q2_K_XL.gguf
#model=/mnt/astrodata/llm/models/unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-IQ2_M.gguf
model=/mnt/astrodata/llm/models/unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-Q4_K_XL.gguf
CUDA_VISIBLE_DEVICE=0 \
./build/bin/llama-sweep-bench \
--model "$model" \
-fa \
-ctk f16 -ctv f16 \
-c 32768 \
-ngl 99 \
--threads 1 \
👈 Perplexity, KLD, and Δp Raw Data Table
Parsed this data from a bunch of logs generated above. It is not in the most beautiful order so feel free to copy paste into google docs or however you'd like to make your own graphs.
Model | Size | 0.1% Δp | 1.0% KLD | 1.0% Δp | 10.0% KLD | 10.0% Δp | 25.0% Δp | 5.0% KLD | 5.0% Δp | 75.0% Δp | 90.0% Δp | 95.0% Δp | 99.0% KLD | 99.0% Δp | 99.9% KLD | 99.9% Δp | Maximum KLD | Maximum Δp | Mean KLD | Mean KLD uncertainty | Mean Δp | Mean Δp uncertainty | Mean PPL(Q) ubergarm-kld-test-corpus.txt | Mean PPL(Q) uncertainty ubergarm-kld-test-corpus.txt | Median KLD | Median Δp | Minimum KLD | Minimum Δp | PPL uncertainty wiki.test.raw | PPL wiki.test.raw | RMS Δp | RMS Δp uncertainty | Same top p | Same top p uncertainty |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Qwen/Qwen3-235B-A22B-BF16 | 438 | |||||||||||||||||||||||||||||||||
ubergarm/Qwen3-235B-A22B-Q8_0 | 233 | 11.7194 | 0.07212 | 0.03321 | 5.3141 | |||||||||||||||||||||||||||||
ubergarm/Qwen3-235B-A22B-mix-IQ3_K | 107 | -18.276% | 0.000036 | -8.542% | 0.000940 | -2.631% | -0.686% | 0.000310 | -4.272% | 0.587% | 2.504% | 4.175% | 0.098368 | 8.257% | 0.296680 | 17.122% | 2.906263 | 63.764% | 0.014594 | 0.000064 | -0.049 | 0.006 | 11.788282 | 0.072648 | 0.008979 | -0.001% | -0.000039 | -72.329% | 0.03421 | 5.4403 | 2.846 | 0.017 | 93.459 | 0.056 |
lmstudio-community/Qwen3-235B-A22B-Q3_K_L | 104 | -27.956% | 0.000083 | -14.266% | 0.002466 | -4.579% | -1.294% | 0.000766 | -7.290% | 0.786% | 3.742% | 6.267% | 0.219563 | 12.470% | 0.628216 | 24.126% | 8.358958 | 77.349% | 0.036266 | 0.000140 | -0.284 | 0.010 | 11.904309 | 0.073302 | 0.023930 | -0.010% | -0.000003 | -99.970% | 0.03584 | 5.6582 | 4.496 | 0.025 | 89.756 | 0.069 |
unsloth/Qwen3-235B-A22B-UD-Q3_K_XL | 97 | -25.243% | 0.000060 | -12.180% | 0.001945 | -3.752% | -0.962% | 0.000612 | -6.159% | 0.874% | 3.649% | 5.976% | 0.180988 | 11.713% | 0.543533 | 22.421% | 5.471307 | 64.130% | 0.029122 | 0.000123 | -0.059 | 0.009 | 11.855173 | 0.073300 | 0.018888 | -0.000% | -0.000004 | -98.693% | 0.03524 | 5.5695 | 4.018 | 0.023 | 90.694 | 0.066 |
Qwen/Qwen3-30B-A3B-BF16 | 56.9 | 15.1443 | 0.10239 | 0.07223 | 9.0703 | |||||||||||||||||||||||||||||
ubergarm/Qwen3-30B-A3B-Q8_0 | 30.3 | -7.050% | 0.000001 | -3.834% | 0.000154 | -1.241% | -0.282% | 0.000038 | -2.035% | 0.231% | 1.176% | 1.964% | 0.013699 | 3.763% | 0.039718 | 7.128% | 0.359152 | 28.466% | 0.002337 | 0.000009 | -0.020 | 0.003 | 15.152095 | 0.102398 | 0.001587 | -0.000% | -0.000047 | -34.379% | 0.07228 | 9.0740 | 1.279 | 0.008 | 96.972 | 0.039 |
ubergarm/Qwen3-30B-A3B-mix-IQ4_K | 17.7 | -11.731% | 0.000004 | -5.522% | 0.000298 | -1.645% | -0.376% | 0.000080 | -2.742% | 0.326% | 1.592% | 2.682% | 0.032109 | 5.373% | 0.104454 | 10.626% | 2.514502 | 39.508% | 0.004821 | 0.000024 | -0.025 | 0.004 | 15.218819 | 0.103071 | 0.002970 | -0.000% | -0.000048 | -44.213% | 0.07278 | 9.1184 | 1.818 | 0.011 | 95.945 | 0.045 |
bartowski/Qwen3-30B-A3B-Q4_K_M | 17.4 | -16.135% | 0.000008 | -8.303% | 0.000652 | -2.643% | -0.645% | 0.000171 | -4.286% | 0.398% | 2.084% | 3.570% | 0.063238 | 7.356% | 0.195169 | 14.392% | 5.985787 | 61.522% | 0.010136 | 0.000053 | -0.158 | 0.006 | 15.194468 | 0.102605 | 0.006434 | -0.001% | -0.000032 | -88.357% | 0.07381 | 9.2092 | 2.619 | 0.018 | 94.329 | 0.053 |
bartowski/Qwen3-30B-A3B-Q4_K_S | 16.8 | -18.122% | 0.000013 | -9.230% | 0.000862 | -3.006% | -0.780% | 0.000235 | -4.787% | 0.402% | 2.215% | 3.866% | 0.077885 | 7.972% | 0.233980 | 15.420% | 5.971601 | 66.795% | 0.012915 | 0.000065 | -0.227 | 0.007 | 15.202408 | 0.102513 | 0.008261 | -0.002% | -0.000038 | -87.019% | 0.07371 | 9.2232 | 2.885 | 0.019 | 93.804 | 0.055 |
unsloth/Qwen3-30B-A3B-UD-Q4_K_XL | 16.5 | -21.984% | 0.000015 | -11.111% | 0.001152 | -3.508% | -0.938% | 0.000315 | -5.582% | 0.421% | 2.460% | 4.261% | 0.102021 | 8.910% | 0.305740 | 17.384% | 5.570370 | 67.990% | 0.016495 | 0.000071 | -0.320 | 0.008 | 15.281833 | 0.103140 | 0.010432 | -0.005% | -0.000016 | -85.356% | 0.07290 | 9.1688 | 3.333 | 0.020 | 93.169 | 0.058 |
ubergarm/Qwen3-30B-A3B-IQ4_KS | 15.5 | -20.721% | 0.000018 | -10.000% | 0.001003 | -3.073% | -0.796% | 0.000292 | -5.017% | 0.442% | 2.398% | 4.167% | 0.094074 | 8.691% | 0.282245 | 16.987% | 6.828948 | 89.561% | 0.014617 | 0.000068 | -0.209 | 0.007 | 15.182811 | 0.102278 | 0.008934 | -0.003% | -0.000031 | -75.475% | 0.07061 | 8.9862 | 3.106 | 0.019 | 93.625 | 0.056 |
ikawrakow/Qwen3-30B-A3B-IQ4_KS-Bartowski | 15.3 | -20.846% | 0.000021 | -10.497% | 0.001098 | -3.434% | -0.905% | 0.000316 | -5.433% | 0.421% | 2.427% | 4.216% | 0.099815 | 8.719% | 0.290617 | 17.546% | 6.971420 | 81.571% | 0.015818 | 0.000074 | -0.288 | 0.007 | 15.150462 | 0.101931 | 0.009988 | -0.004% | -0.000029 | -86.592% | 0.07078 | 9.0016 | 3.244 | 0.020 | 93.317 | 0.057 |
ikawrakow/Qwen3-30B-A3B-IQ4_KS-IK | 15.3 | -21.414% | 0.000026 | -10.689% | 0.001192 | -3.461% | -0.959% | 0.000352 | -5.489% | 0.405% | 2.383% | 4.163% | 0.102473 | 8.750% | 0.301946 | 17.416% | 7.146766 | 58.365% | 0.016277 | 0.000074 | -0.323 | 0.007 | 15.161535 | 0.101972 | 0.010269 | -0.006% | -0.000007 | -90.822% | 0.07094 | 9.0177 | 3.265 | 0.019 | 93.216 | 0.057 |
ikawrakow/Qwen3-30B-A3B-IQ4_KS-Unslolth | 15.3 | -21.919% | 0.000023 | -11.082% | 0.001218 | -3.610% | -1.015% | 0.000351 | -5.698% | 0.396% | 2.355% | 4.173% | 0.104796 | 8.799% | 0.314624 | 18.042% | 7.383745 | 78.742% | 0.016845 | 0.000077 | -0.366 | 0.008 | 15.109454 | 0.101327 | 0.010667 | -0.006% | -0.000012 | -86.065% | 0.06945 | 8.9171 | 3.331 | 0.020 | 93.217 | 0.057 |
unsloth/Qwen3-30B-A3B-UD-IQ2_M | 10.1 | -47.141% | 0.000072 | -22.803% | 0.004283 | -6.698% | -1.739% | 0.001229 | -11.071% | 0.843% | 4.934% | 8.514% | 0.457244 | 17.671% | 1.370219 | 34.262% | 8.153114 | 88.509% | 0.066646 | 0.000267 | -0.607 | 0.015 | 15.889509 | 0.107834 | 0.039668 | -0.011% | -0.000011 | -99.283% | 0.08541 | 10.3726 | 6.627 | 0.033 | 87.029 | 0.077 |
bartowski/Qwen3-30B-A3B-IQ2_M | 9.7 | -48.093% | 0.000068 | -24.583% | 0.005231 | -8.541% | -2.590% | 0.001459 | -13.210% | 0.538% | 4.031% | 7.477% | 0.432021 | 16.466% | 1.262156 | 31.659% | 8.695639 | 80.027% | 0.069100 | 0.000258 | -1.300 | 0.016 | 15.436905 | 0.102661 | 0.044448 | -0.039% | -0.000004 | -96.452% | 0.08036 | 9.9788 | 6.979 | 0.033 | 86.303 | 0.079 |
👈 Benchmark Suite Raw Data Table
TODO copy/paste it all somewhere if there is enough interest.
👈 llama-sweep-bench Speed Data
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 0.186 | 2746.40 | 0.912 | 140.37 |
512 | 128 | 512 | 0.189 | 2709.05 | 0.941 | 135.99 |
512 | 128 | 1024 | 0.190 | 2689.73 | 0.940 | 136.22 |
512 | 128 | 1536 | 0.195 | 2631.96 | 0.943 | 135.78 |
512 | 128 | 2048 | 0.197 | 2601.24 | 0.957 | 133.69 |
512 | 128 | 2560 | 0.201 | 2553.51 | 0.959 | 133.43 |
512 | 128 | 3072 | 0.203 | 2526.21 | 0.966 | 132.56 |
512 | 128 | 3584 | 0.207 | 2472.32 | 0.976 | 131.16 |
512 | 128 | 4096 | 0.210 | 2432.41 | 0.986 | 129.80 |
512 | 128 | 4608 | 0.213 | 2406.39 | 0.996 | 128.50 |
512 | 128 | 5120 | 0.215 | 2385.53 | 1.008 | 126.99 |
512 | 128 | 5632 | 0.218 | 2347.09 | 1.018 | 125.72 |
512 | 128 | 6144 | 0.221 | 2321.62 | 1.029 | 124.44 |
512 | 128 | 6656 | 0.224 | 2287.95 | 1.041 | 123.02 |
512 | 128 | 7168 | 0.227 | 2252.04 | 1.053 | 121.57 |
512 | 128 | 7680 | 0.231 | 2218.25 | 1.065 | 120.17 |
512 | 128 | 8192 | 0.233 | 2194.17 | 1.075 | 119.04 |
512 | 128 | 8704 | 0.235 | 2175.86 | 1.086 | 117.92 |
512 | 128 | 9216 | 0.240 | 2133.00 | 1.099 | 116.47 |
512 | 128 | 9728 | 0.241 | 2126.89 | 1.109 | 115.46 |
512 | 128 | 10240 | 0.245 | 2089.25 | 1.120 | 114.25 |
512 | 128 | 10752 | 0.249 | 2055.28 | 1.164 | 109.96 |
512 | 128 | 11264 | 0.252 | 2032.46 | 1.181 | 108.43 |
512 | 128 | 11776 | 0.254 | 2011.96 | 1.171 | 109.29 |
512 | 128 | 12288 | 0.257 | 1993.13 | 1.175 | 108.95 |
512 | 128 | 12800 | 0.260 | 1970.94 | 1.184 | 108.08 |
512 | 128 | 13312 | 0.264 | 1939.95 | 1.186 | 107.95 |
512 | 128 | 13824 | 0.265 | 1930.30 | 1.194 | 107.24 |
512 | 128 | 14336 | 0.270 | 1897.48 | 1.197 | 106.89 |
512 | 128 | 14848 | 0.272 | 1880.96 | 1.204 | 106.32 |
512 | 128 | 15360 | 0.276 | 1856.05 | 1.214 | 105.45 |
512 | 128 | 15872 | 0.279 | 1832.42 | 1.221 | 104.82 |
512 | 128 | 16384 | 0.283 | 1809.73 | 1.229 | 104.13 |
512 | 128 | 16896 | 0.285 | 1796.89 | 1.234 | 103.69 |
512 | 128 | 17408 | 0.288 | 1778.96 | 1.242 | 103.08 |
512 | 128 | 17920 | 0.293 | 1746.74 | 1.249 | 102.52 |
512 | 128 | 18432 | 0.296 | 1729.58 | 1.256 | 101.89 |
512 | 128 | 18944 | 0.298 | 1715.59 | 1.264 | 101.23 |
512 | 128 | 19456 | 0.302 | 1697.53 | 1.269 | 100.87 |
512 | 128 | 19968 | 0.304 | 1684.14 | 1.278 | 100.13 |
512 | 128 | 20480 | 0.307 | 1665.46 | 1.284 | 99.71 |
512 | 128 | 20992 | 0.311 | 1644.88 | 1.291 | 99.12 |
512 | 128 | 21504 | 0.314 | 1631.38 | 1.334 | 95.97 |
512 | 128 | 22016 | 0.317 | 1613.83 | 1.347 | 95.01 |
512 | 128 | 22528 | 0.321 | 1596.46 | 1.339 | 95.57 |
512 | 128 | 23040 | 0.322 | 1589.42 | 1.345 | 95.16 |
512 | 128 | 23552 | 0.325 | 1573.55 | 1.352 | 94.64 |
512 | 128 | 24064 | 0.329 | 1556.41 | 1.358 | 94.25 |
512 | 128 | 24576 | 0.333 | 1537.96 | 1.363 | 93.93 |
512 | 128 | 25088 | 0.335 | 1529.21 | 1.369 | 93.52 |
512 | 128 | 25600 | 0.340 | 1506.80 | 1.378 | 92.91 |
512 | 128 | 26112 | 0.343 | 1494.38 | 1.383 | 92.54 |
512 | 128 | 26624 | 0.347 | 1476.69 | 1.392 | 91.98 |
512 | 128 | 27136 | 0.350 | 1464.63 | 1.398 | 91.53 |
512 | 128 | 27648 | 0.353 | 1451.77 | 1.405 | 91.13 |
512 | 128 | 28160 | 0.355 | 1442.42 | 1.411 | 90.69 |
512 | 128 | 28672 | 0.359 | 1427.94 | 1.418 | 90.26 |
512 | 128 | 29184 | 0.362 | 1415.01 | 1.426 | 89.77 |
512 | 128 | 29696 | 0.364 | 1406.75 | 1.433 | 89.33 |
512 | 128 | 30208 | 0.367 | 1393.57 | 1.441 | 88.84 |
512 | 128 | 30720 | 0.371 | 1379.72 | 1.450 | 88.27 |
512 | 128 | 31232 | 0.374 | 1367.29 | 1.456 | 87.93 |
512 | 128 | 31744 | 0.378 | 1355.16 | 1.464 | 87.43 |
512 | 128 | 32256 | 0.381 | 1343.89 | 1.507 | 84.94 |
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 0.219 | 2342.04 | 0.940 | 136.14 |
512 | 128 | 512 | 0.221 | 2320.24 | 0.968 | 132.17 |
512 | 128 | 1024 | 0.222 | 2302.08 | 0.968 | 132.25 |
512 | 128 | 1536 | 0.228 | 2245.09 | 0.976 | 131.11 |
512 | 128 | 2048 | 0.230 | 2230.09 | 0.990 | 129.34 |
512 | 128 | 2560 | 0.233 | 2201.35 | 0.998 | 128.21 |
512 | 128 | 3072 | 0.236 | 2168.36 | 1.005 | 127.38 |
512 | 128 | 3584 | 0.240 | 2128.94 | 1.014 | 126.18 |
512 | 128 | 4096 | 0.243 | 2102.88 | 1.025 | 124.82 |
512 | 128 | 4608 | 0.245 | 2093.47 | 1.035 | 123.68 |
512 | 128 | 5120 | 0.248 | 2062.11 | 1.045 | 122.44 |
512 | 128 | 5632 | 0.251 | 2042.84 | 1.057 | 121.12 |
512 | 128 | 6144 | 0.254 | 2016.60 | 1.069 | 119.78 |
512 | 128 | 6656 | 0.256 | 1996.33 | 1.081 | 118.46 |
512 | 128 | 7168 | 0.260 | 1965.62 | 1.090 | 117.42 |
512 | 128 | 7680 | 0.264 | 1939.11 | 1.103 | 116.03 |
512 | 128 | 8192 | 0.267 | 1917.69 | 1.114 | 114.86 |
512 | 128 | 8704 | 0.269 | 1902.68 | 1.123 | 113.97 |
512 | 128 | 9216 | 0.275 | 1864.88 | 1.139 | 112.41 |
512 | 128 | 9728 | 0.275 | 1864.80 | 1.149 | 111.43 |
512 | 128 | 10240 | 0.280 | 1831.10 | 1.173 | 109.12 |
512 | 128 | 10752 | 0.282 | 1813.40 | 1.209 | 105.90 |
512 | 128 | 11264 | 0.286 | 1792.80 | 1.224 | 104.61 |
512 | 128 | 11776 | 0.289 | 1769.64 | 1.217 | 105.19 |
512 | 128 | 12288 | 0.291 | 1756.56 | 1.219 | 104.97 |
512 | 128 | 12800 | 0.296 | 1730.89 | 1.230 | 104.08 |
512 | 128 | 13312 | 0.298 | 1717.56 | 1.231 | 103.94 |
512 | 128 | 13824 | 0.299 | 1709.78 | 1.237 | 103.48 |
512 | 128 | 14336 | 0.304 | 1684.98 | 1.241 | 103.15 |
512 | 128 | 14848 | 0.306 | 1672.32 | 1.247 | 102.63 |
512 | 128 | 15360 | 0.309 | 1657.69 | 1.251 | 102.28 |
512 | 128 | 15872 | 0.312 | 1642.84 | 1.258 | 101.72 |
512 | 128 | 16384 | 0.316 | 1620.66 | 1.265 | 101.16 |
512 | 128 | 16896 | 0.319 | 1603.11 | 1.271 | 100.68 |
512 | 128 | 17408 | 0.322 | 1592.25 | 1.280 | 100.04 |
512 | 128 | 17920 | 0.325 | 1573.98 | 1.286 | 99.52 |
512 | 128 | 18432 | 0.328 | 1560.54 | 1.295 | 98.82 |
512 | 128 | 18944 | 0.331 | 1547.27 | 1.303 | 98.27 |
512 | 128 | 19456 | 0.336 | 1525.32 | 1.308 | 97.87 |
512 | 128 | 19968 | 0.336 | 1523.96 | 1.317 | 97.16 |
512 | 128 | 20480 | 0.339 | 1509.92 | 1.323 | 96.72 |
512 | 128 | 20992 | 0.342 | 1498.56 | 1.328 | 96.36 |
512 | 128 | 21504 | 0.344 | 1487.29 | 1.368 | 93.54 |
512 | 128 | 22016 | 0.348 | 1469.52 | 1.386 | 92.32 |
512 | 128 | 22528 | 0.351 | 1458.22 | 1.377 | 92.95 |
512 | 128 | 23040 | 0.354 | 1447.65 | 1.383 | 92.56 |
512 | 128 | 23552 | 0.357 | 1434.13 | 1.392 | 91.95 |
512 | 128 | 24064 | 0.361 | 1417.81 | 1.397 | 91.60 |
512 | 128 | 24576 | 0.365 | 1401.75 | 1.400 | 91.40 |
512 | 128 | 25088 | 0.367 | 1395.82 | 1.408 | 90.89 |
512 | 128 | 25600 | 0.369 | 1387.75 | 1.412 | 90.67 |
512 | 128 | 26112 | 0.374 | 1368.77 | 1.418 | 90.29 |
512 | 128 | 26624 | 0.377 | 1359.02 | 1.427 | 89.71 |
512 | 128 | 27136 | 0.380 | 1347.28 | 1.434 | 89.25 |
512 | 128 | 27648 | 0.383 | 1336.61 | 1.439 | 88.92 |
512 | 128 | 28160 | 0.387 | 1322.05 | 1.446 | 88.50 |
512 | 128 | 28672 | 0.389 | 1315.73 | 1.454 | 88.02 |
512 | 128 | 29184 | 0.392 | 1307.57 | 1.461 | 87.58 |
512 | 128 | 29696 | 0.395 | 1295.59 | 1.468 | 87.16 |
512 | 128 | 30208 | 0.400 | 1281.33 | 1.475 | 86.77 |
512 | 128 | 30720 | 0.403 | 1269.72 | 1.485 | 86.17 |
512 | 128 | 31232 | 0.406 | 1260.77 | 1.493 | 85.75 |
512 | 128 | 31744 | 0.411 | 1245.97 | 1.499 | 85.37 |
512 | 128 | 32256 | 0.411 | 1244.60 | 1.538 | 83.20 |
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 0.199 | 2571.39 | 0.929 | 137.72 |
512 | 128 | 512 | 0.200 | 2558.87 | 0.958 | 133.66 |
512 | 128 | 1024 | 0.205 | 2502.88 | 0.958 | 133.60 |
512 | 128 | 1536 | 0.209 | 2449.39 | 0.966 | 132.45 |
512 | 128 | 2048 | 0.211 | 2424.91 | 0.979 | 130.70 |
512 | 128 | 2560 | 0.214 | 2387.42 | 0.981 | 130.51 |
512 | 128 | 3072 | 0.217 | 2359.21 | 0.990 | 129.36 |
512 | 128 | 3584 | 0.220 | 2322.95 | 1.001 | 127.93 |
512 | 128 | 4096 | 0.224 | 2281.51 | 1.011 | 126.63 |
512 | 128 | 4608 | 0.226 | 2264.66 | 1.020 | 125.44 |
512 | 128 | 5120 | 0.228 | 2246.85 | 1.031 | 124.21 |
512 | 128 | 5632 | 0.231 | 2218.24 | 1.040 | 123.07 |
512 | 128 | 6144 | 0.235 | 2177.99 | 1.054 | 121.47 |
512 | 128 | 6656 | 0.237 | 2158.85 | 1.065 | 120.14 |
512 | 128 | 7168 | 0.241 | 2124.91 | 1.078 | 118.72 |
512 | 128 | 7680 | 0.245 | 2088.47 | 1.094 | 116.98 |
512 | 128 | 8192 | 0.248 | 2066.12 | 1.106 | 115.68 |
512 | 128 | 8704 | 0.250 | 2044.39 | 1.117 | 114.54 |
512 | 128 | 9216 | 0.253 | 2023.04 | 1.130 | 113.27 |
512 | 128 | 9728 | 0.256 | 2002.81 | 1.141 | 112.18 |
512 | 128 | 10240 | 0.259 | 1980.01 | 1.154 | 110.94 |
512 | 128 | 10752 | 0.263 | 1945.18 | 1.198 | 106.84 |
512 | 128 | 11264 | 0.265 | 1928.54 | 1.211 | 105.70 |
512 | 128 | 11776 | 0.268 | 1908.01 | 1.204 | 106.28 |
512 | 128 | 12288 | 0.271 | 1891.82 | 1.207 | 106.08 |
512 | 128 | 12800 | 0.275 | 1861.92 | 1.216 | 105.27 |
512 | 128 | 13312 | 0.277 | 1846.15 | 1.219 | 104.99 |
512 | 128 | 13824 | 0.280 | 1829.45 | 1.226 | 104.43 |
512 | 128 | 14336 | 0.283 | 1807.34 | 1.229 | 104.17 |
512 | 128 | 14848 | 0.286 | 1789.55 | 1.233 | 103.77 |
512 | 128 | 15360 | 0.289 | 1774.14 | 1.241 | 103.12 |
512 | 128 | 15872 | 0.293 | 1750.23 | 1.248 | 102.55 |
512 | 128 | 16384 | 0.296 | 1730.68 | 1.256 | 101.88 |
512 | 128 | 16896 | 0.299 | 1713.86 | 1.261 | 101.49 |
512 | 128 | 17408 | 0.301 | 1700.49 | 1.271 | 100.72 |
512 | 128 | 17920 | 0.306 | 1671.47 | 1.281 | 99.93 |
512 | 128 | 18432 | 0.310 | 1652.08 | 1.291 | 99.17 |
512 | 128 | 18944 | 0.313 | 1637.83 | 1.299 | 98.53 |
512 | 128 | 19456 | 0.316 | 1618.98 | 1.302 | 98.32 |
512 | 128 | 19968 | 0.317 | 1612.79 | 1.314 | 97.42 |
512 | 128 | 20480 | 0.321 | 1595.76 | 1.319 | 97.04 |
512 | 128 | 20992 | 0.326 | 1572.01 | 1.327 | 96.43 |
512 | 128 | 21504 | 0.328 | 1561.24 | 1.369 | 93.51 |
512 | 128 | 22016 | 0.332 | 1543.74 | 1.383 | 92.57 |
512 | 128 | 22528 | 0.335 | 1529.05 | 1.373 | 93.23 |
512 | 128 | 23040 | 0.336 | 1524.73 | 1.374 | 93.17 |
512 | 128 | 23552 | 0.337 | 1517.70 | 1.386 | 92.33 |
512 | 128 | 24064 | 0.343 | 1493.95 | 1.387 | 92.27 |
512 | 128 | 24576 | 0.346 | 1481.52 | 1.393 | 91.88 |
512 | 128 | 25088 | 0.349 | 1466.47 | 1.401 | 91.37 |
512 | 128 | 25600 | 0.350 | 1462.59 | 1.406 | 91.06 |
512 | 128 | 26112 | 0.356 | 1438.68 | 1.413 | 90.61 |
512 | 128 | 26624 | 0.359 | 1425.06 | 1.418 | 90.29 |
512 | 128 | 27136 | 0.361 | 1417.08 | 1.426 | 89.75 |
512 | 128 | 27648 | 0.365 | 1403.93 | 1.433 | 89.33 |
512 | 128 | 28160 | 0.368 | 1389.95 | 1.442 | 88.74 |
512 | 128 | 28672 | 0.371 | 1380.36 | 1.454 | 88.02 |
512 | 128 | 29184 | 0.374 | 1369.27 | 1.458 | 87.79 |
512 | 128 | 29696 | 0.378 | 1355.92 | 1.465 | 87.36 |
512 | 128 | 30208 | 0.381 | 1345.24 | 1.471 | 87.01 |
512 | 128 | 30720 | 0.383 | 1336.71 | 1.482 | 86.39 |
512 | 128 | 31232 | 0.387 | 1324.60 | 1.486 | 86.11 |
512 | 128 | 31744 | 0.390 | 1311.28 | 1.494 | 85.65 |
512 | 128 | 32256 | 0.393 | 1302.29 | 1.535 | 83.40 |
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 0.185 | 2771.13 | 0.895 | 143.07 |
512 | 128 | 512 | 0.187 | 2735.63 | 0.923 | 138.71 |
512 | 128 | 1024 | 0.190 | 2699.01 | 0.921 | 138.95 |
512 | 128 | 1536 | 0.195 | 2627.30 | 0.930 | 137.64 |
512 | 128 | 2048 | 0.196 | 2614.49 | 0.943 | 135.73 |
512 | 128 | 2560 | 0.200 | 2560.59 | 0.947 | 135.10 |
512 | 128 | 3072 | 0.202 | 2528.42 | 0.954 | 134.19 |
512 | 128 | 3584 | 0.206 | 2481.69 | 0.964 | 132.77 |
512 | 128 | 4096 | 0.210 | 2443.23 | 0.974 | 131.47 |
512 | 128 | 4608 | 0.212 | 2413.67 | 0.985 | 129.96 |
512 | 128 | 5120 | 0.214 | 2394.67 | 0.995 | 128.61 |
512 | 128 | 5632 | 0.219 | 2340.45 | 1.015 | 126.14 |
512 | 128 | 6144 | 0.222 | 2306.96 | 1.024 | 125.01 |
512 | 128 | 6656 | 0.225 | 2273.36 | 1.035 | 123.64 |
512 | 128 | 7168 | 0.228 | 2242.54 | 1.050 | 121.92 |
512 | 128 | 7680 | 0.231 | 2212.63 | 1.060 | 120.71 |
512 | 128 | 8192 | 0.235 | 2182.09 | 1.068 | 119.82 |
512 | 128 | 8704 | 0.237 | 2157.82 | 1.082 | 118.25 |
512 | 128 | 9216 | 0.241 | 2123.14 | 1.097 | 116.72 |
512 | 128 | 9728 | 0.243 | 2109.32 | 1.104 | 115.90 |
512 | 128 | 10240 | 0.246 | 2077.16 | 1.119 | 114.35 |
512 | 128 | 10752 | 0.250 | 2049.47 | 1.168 | 109.62 |
512 | 128 | 11264 | 0.254 | 2017.75 | 1.183 | 108.21 |
512 | 128 | 11776 | 0.255 | 2009.66 | 1.173 | 109.13 |
512 | 128 | 12288 | 0.259 | 1976.27 | 1.176 | 108.86 |
512 | 128 | 12800 | 0.261 | 1957.95 | 1.186 | 107.97 |
512 | 128 | 13312 | 0.266 | 1926.83 | 1.187 | 107.84 |
512 | 128 | 13824 | 0.267 | 1914.87 | 1.191 | 107.45 |
512 | 128 | 14336 | 0.271 | 1888.06 | 1.196 | 107.00 |
512 | 128 | 14848 | 0.274 | 1869.73 | 1.202 | 106.49 |
512 | 128 | 15360 | 0.277 | 1849.09 | 1.209 | 105.84 |
512 | 128 | 15872 | 0.280 | 1828.40 | 1.215 | 105.35 |
512 | 128 | 16384 | 0.284 | 1801.44 | 1.224 | 104.57 |
512 | 128 | 16896 | 0.287 | 1781.87 | 1.229 | 104.13 |
512 | 128 | 17408 | 0.290 | 1767.18 | 1.239 | 103.35 |
512 | 128 | 17920 | 0.293 | 1747.06 | 1.245 | 102.83 |
512 | 128 | 18432 | 0.296 | 1731.39 | 1.252 | 102.25 |
512 | 128 | 18944 | 0.299 | 1712.43 | 1.259 | 101.64 |
512 | 128 | 19456 | 0.303 | 1690.65 | 1.265 | 101.17 |
512 | 128 | 19968 | 0.304 | 1682.41 | 1.276 | 100.31 |
512 | 128 | 20480 | 0.308 | 1660.25 | 1.280 | 99.99 |
512 | 128 | 20992 | 0.312 | 1641.94 | 1.285 | 99.57 |
512 | 128 | 21504 | 0.314 | 1628.35 | 1.331 | 96.17 |
512 | 128 | 22016 | 0.318 | 1611.79 | 1.346 | 95.11 |
512 | 128 | 22528 | 0.321 | 1596.28 | 1.337 | 95.72 |
512 | 128 | 23040 | 0.324 | 1580.92 | 1.340 | 95.54 |
512 | 128 | 23552 | 0.325 | 1573.30 | 1.351 | 94.74 |
512 | 128 | 24064 | 0.330 | 1552.94 | 1.350 | 94.81 |
512 | 128 | 24576 | 0.334 | 1534.84 | 1.355 | 94.48 |
512 | 128 | 25088 | 0.335 | 1526.93 | 1.361 | 94.06 |
512 | 128 | 25600 | 0.339 | 1511.89 | 1.366 | 93.70 |
512 | 128 | 26112 | 0.343 | 1492.70 | 1.383 | 92.55 |
512 | 128 | 26624 | 0.347 | 1476.86 | 1.387 | 92.27 |
512 | 128 | 27136 | 0.350 | 1462.35 | 1.397 | 91.63 |
512 | 128 | 27648 | 0.354 | 1446.91 | 1.404 | 91.16 |
512 | 128 | 28160 | 0.356 | 1438.02 | 1.412 | 90.66 |
512 | 128 | 28672 | 0.361 | 1419.66 | 1.418 | 90.26 |
512 | 128 | 29184 | 0.362 | 1413.92 | 1.426 | 89.77 |
512 | 128 | 29696 | 0.365 | 1401.20 | 1.433 | 89.32 |
512 | 128 | 30208 | 0.368 | 1391.23 | 1.439 | 88.97 |
512 | 128 | 30720 | 0.372 | 1377.54 | 1.450 | 88.29 |
512 | 128 | 31232 | 0.374 | 1369.93 | 1.453 | 88.09 |
512 | 128 | 31744 | 0.378 | 1356.09 | 1.462 | 87.56 |
512 | 128 | 32256 | 0.380 | 1347.04 | 1.503 | 85.14 |
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 0.211 | 2423.89 | 0.943 | 135.74 |
512 | 128 | 512 | 0.213 | 2399.31 | 0.971 | 131.85 |
512 | 128 | 1024 | 0.216 | 2374.33 | 0.969 | 132.11 |
512 | 128 | 1536 | 0.219 | 2340.30 | 0.979 | 130.80 |
512 | 128 | 2048 | 0.220 | 2325.55 | 0.991 | 129.11 |
512 | 128 | 2560 | 0.225 | 2276.44 | 0.994 | 128.82 |
512 | 128 | 3072 | 0.228 | 2247.84 | 1.001 | 127.89 |
512 | 128 | 3584 | 0.232 | 2207.44 | 1.011 | 126.60 |
512 | 128 | 4096 | 0.236 | 2170.20 | 1.023 | 125.06 |
512 | 128 | 4608 | 0.236 | 2166.89 | 1.032 | 124.00 |
512 | 128 | 5120 | 0.240 | 2131.06 | 1.044 | 122.58 |
512 | 128 | 5632 | 0.244 | 2102.12 | 1.054 | 121.41 |
512 | 128 | 6144 | 0.247 | 2076.33 | 1.063 | 120.43 |
512 | 128 | 6656 | 0.249 | 2055.14 | 1.077 | 118.82 |
512 | 128 | 7168 | 0.253 | 2024.47 | 1.088 | 117.68 |
512 | 128 | 7680 | 0.256 | 1996.90 | 1.099 | 116.45 |
512 | 128 | 8192 | 0.260 | 1967.17 | 1.114 | 114.93 |
512 | 128 | 8704 | 0.260 | 1967.20 | 1.122 | 114.06 |
512 | 128 | 9216 | 0.266 | 1922.64 | 1.135 | 112.81 |
512 | 128 | 9728 | 0.268 | 1911.09 | 1.147 | 111.63 |
512 | 128 | 10240 | 0.272 | 1885.44 | 1.157 | 110.64 |
512 | 128 | 10752 | 0.274 | 1865.36 | 1.202 | 106.45 |
512 | 128 | 11264 | 0.278 | 1844.60 | 1.217 | 105.18 |
512 | 128 | 11776 | 0.279 | 1836.43 | 1.208 | 105.93 |
512 | 128 | 12288 | 0.283 | 1810.13 | 1.213 | 105.57 |
512 | 128 | 12800 | 0.288 | 1780.11 | 1.229 | 104.16 |
512 | 128 | 13312 | 0.291 | 1758.14 | 1.229 | 104.12 |
512 | 128 | 13824 | 0.292 | 1753.98 | 1.238 | 103.39 |
512 | 128 | 14336 | 0.298 | 1718.12 | 1.241 | 103.10 |
512 | 128 | 14848 | 0.300 | 1706.26 | 1.247 | 102.61 |
512 | 128 | 15360 | 0.302 | 1693.28 | 1.254 | 102.07 |
512 | 128 | 15872 | 0.306 | 1673.01 | 1.262 | 101.46 |
512 | 128 | 16384 | 0.310 | 1650.90 | 1.268 | 100.96 |
512 | 128 | 16896 | 0.313 | 1638.03 | 1.275 | 100.41 |
512 | 128 | 17408 | 0.315 | 1625.29 | 1.281 | 99.90 |
512 | 128 | 17920 | 0.318 | 1609.23 | 1.289 | 99.31 |
512 | 128 | 18432 | 0.322 | 1589.10 | 1.297 | 98.68 |
512 | 128 | 18944 | 0.325 | 1575.42 | 1.302 | 98.29 |
512 | 128 | 19456 | 0.330 | 1553.28 | 1.310 | 97.73 |
512 | 128 | 19968 | 0.330 | 1552.98 | 1.319 | 97.05 |
512 | 128 | 20480 | 0.334 | 1531.58 | 1.324 | 96.67 |
512 | 128 | 20992 | 0.337 | 1518.07 | 1.332 | 96.12 |
512 | 128 | 21504 | 0.340 | 1507.15 | 1.373 | 93.25 |
512 | 128 | 22016 | 0.344 | 1488.06 | 1.385 | 92.41 |
512 | 128 | 22528 | 0.347 | 1477.13 | 1.378 | 92.88 |
512 | 128 | 23040 | 0.349 | 1467.54 | 1.384 | 92.47 |
512 | 128 | 23552 | 0.351 | 1459.50 | 1.394 | 91.80 |
512 | 128 | 24064 | 0.356 | 1440.13 | 1.397 | 91.61 |
512 | 128 | 24576 | 0.359 | 1426.95 | 1.401 | 91.36 |
512 | 128 | 25088 | 0.360 | 1423.59 | 1.409 | 90.82 |
512 | 128 | 25600 | 0.364 | 1405.52 | 1.413 | 90.62 |
512 | 128 | 26112 | 0.369 | 1388.93 | 1.419 | 90.18 |
512 | 128 | 26624 | 0.371 | 1379.47 | 1.426 | 89.79 |
512 | 128 | 27136 | 0.374 | 1369.38 | 1.434 | 89.28 |
512 | 128 | 27648 | 0.377 | 1357.58 | 1.441 | 88.85 |
512 | 128 | 28160 | 0.382 | 1342.07 | 1.447 | 88.44 |
512 | 128 | 28672 | 0.384 | 1333.90 | 1.455 | 87.99 |
512 | 128 | 29184 | 0.386 | 1326.66 | 1.461 | 87.62 |
512 | 128 | 29696 | 0.390 | 1313.92 | 1.468 | 87.22 |
512 | 128 | 30208 | 0.394 | 1298.28 | 1.483 | 86.34 |
512 | 128 | 30720 | 0.398 | 1286.81 | 1.488 | 86.02 |
512 | 128 | 31232 | 0.400 | 1280.36 | 1.494 | 85.70 |
512 | 128 | 31744 | 0.405 | 1263.20 | 1.502 | 85.21 |
512 | 128 | 32256 | 0.407 | 1257.02 | 1.545 | 82.83 |
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 0.198 | 2588.92 | 0.982 | 130.29 |
512 | 128 | 512 | 0.199 | 2574.42 | 1.008 | 127.04 |
512 | 128 | 1024 | 0.203 | 2527.70 | 1.007 | 127.07 |
512 | 128 | 1536 | 0.206 | 2488.20 | 1.017 | 125.90 |
512 | 128 | 2048 | 0.207 | 2468.48 | 1.031 | 124.15 |
512 | 128 | 2560 | 0.211 | 2427.66 | 1.037 | 123.42 |
512 | 128 | 3072 | 0.215 | 2376.22 | 1.045 | 122.45 |
512 | 128 | 3584 | 0.218 | 2344.71 | 1.055 | 121.32 |
512 | 128 | 4096 | 0.220 | 2323.83 | 1.066 | 120.13 |
512 | 128 | 4608 | 0.224 | 2286.56 | 1.075 | 119.08 |
512 | 128 | 5120 | 0.226 | 2263.56 | 1.086 | 117.87 |
512 | 128 | 5632 | 0.229 | 2233.20 | 1.097 | 116.64 |
512 | 128 | 6144 | 0.231 | 2216.06 | 1.108 | 115.56 |
512 | 128 | 6656 | 0.235 | 2174.18 | 1.125 | 113.82 |
512 | 128 | 7168 | 0.239 | 2141.53 | 1.137 | 112.61 |
512 | 128 | 7680 | 0.244 | 2099.48 | 1.148 | 111.48 |
512 | 128 | 8192 | 0.245 | 2087.77 | 1.160 | 110.38 |
512 | 128 | 8704 | 0.247 | 2076.19 | 1.170 | 109.38 |
512 | 128 | 9216 | 0.251 | 2040.21 | 1.183 | 108.22 |
512 | 128 | 9728 | 0.252 | 2028.41 | 1.192 | 107.39 |
512 | 128 | 10240 | 0.255 | 2006.18 | 1.204 | 106.35 |
512 | 128 | 10752 | 0.257 | 1988.99 | 1.247 | 102.62 |
512 | 128 | 11264 | 0.261 | 1963.06 | 1.264 | 101.28 |
512 | 128 | 11776 | 0.262 | 1951.61 | 1.257 | 101.84 |
512 | 128 | 12288 | 0.265 | 1932.03 | 1.260 | 101.62 |
512 | 128 | 12800 | 0.269 | 1901.02 | 1.269 | 100.83 |
512 | 128 | 13312 | 0.272 | 1882.44 | 1.271 | 100.72 |
512 | 128 | 13824 | 0.273 | 1873.24 | 1.274 | 100.48 |
512 | 128 | 14336 | 0.277 | 1845.12 | 1.281 | 99.91 |
512 | 128 | 14848 | 0.280 | 1830.87 | 1.290 | 99.24 |
512 | 128 | 15360 | 0.282 | 1812.46 | 1.296 | 98.79 |
512 | 128 | 15872 | 0.286 | 1793.02 | 1.302 | 98.31 |
512 | 128 | 16384 | 0.288 | 1778.72 | 1.309 | 97.80 |
512 | 128 | 16896 | 0.293 | 1745.22 | 1.316 | 97.26 |
512 | 128 | 17408 | 0.295 | 1732.67 | 1.323 | 96.76 |
512 | 128 | 17920 | 0.299 | 1714.14 | 1.331 | 96.19 |
512 | 128 | 18432 | 0.301 | 1698.58 | 1.337 | 95.74 |
512 | 128 | 18944 | 0.306 | 1675.72 | 1.350 | 94.84 |
512 | 128 | 19456 | 0.307 | 1668.01 | 1.349 | 94.88 |
512 | 128 | 19968 | 0.313 | 1636.65 | 1.360 | 94.11 |
512 | 128 | 20480 | 0.314 | 1632.97 | 1.366 | 93.72 |
512 | 128 | 20992 | 0.316 | 1620.05 | 1.374 | 93.17 |
512 | 128 | 21504 | 0.319 | 1606.86 | 1.411 | 90.70 |
512 | 128 | 22016 | 0.322 | 1590.15 | 1.426 | 89.75 |
512 | 128 | 22528 | 0.327 | 1567.20 | 1.422 | 90.01 |
512 | 128 | 23040 | 0.330 | 1553.12 | 1.425 | 89.83 |
512 | 128 | 23552 | 0.333 | 1536.30 | 1.434 | 89.28 |
512 | 128 | 24064 | 0.337 | 1520.89 | 1.434 | 89.24 |
512 | 128 | 24576 | 0.339 | 1508.19 | 1.440 | 88.87 |
512 | 128 | 25088 | 0.343 | 1492.82 | 1.446 | 88.52 |
512 | 128 | 25600 | 0.344 | 1487.87 | 1.451 | 88.21 |
512 | 128 | 26112 | 0.350 | 1461.28 | 1.459 | 87.74 |
512 | 128 | 26624 | 0.350 | 1463.85 | 1.466 | 87.32 |
512 | 128 | 27136 | 0.354 | 1445.83 | 1.474 | 86.86 |
512 | 128 | 27648 | 0.357 | 1432.50 | 1.485 | 86.20 |
512 | 128 | 28160 | 0.363 | 1410.51 | 1.487 | 86.10 |
512 | 128 | 28672 | 0.365 | 1402.82 | 1.493 | 85.72 |
512 | 128 | 29184 | 0.368 | 1389.55 | 1.502 | 85.22 |
512 | 128 | 29696 | 0.371 | 1379.92 | 1.508 | 84.87 |
512 | 128 | 30208 | 0.374 | 1367.99 | 1.514 | 84.55 |
512 | 128 | 30720 | 0.377 | 1359.40 | 1.524 | 84.00 |
512 | 128 | 31232 | 0.378 | 1353.18 | 1.529 | 83.72 |
512 | 128 | 31744 | 0.382 | 1338.84 | 1.538 | 83.22 |
512 | 128 | 32256 | 0.386 | 1327.16 | 1.578 | 81.10 |
👈 PPL, KLD, Δp Statistics
In general these attempt to systematically measure the difference an unquantized model and a given quantized version. In general lower is better as it signals the quantized version performs more similarly to the original.Quantization is the process of compressing an original model's weights to shrink it down to run on limited hardware. Ideally the process minimizes errors and preserves the original uncompressed model's performance.
Perplexity (PPL) is a metric used to evaluate how well a language model predicts text. It essentially measures how "surprised" the model is by a given text—if the model is good at predicting the next word, the perplexity is low. For example, a model that generates coherent, contextually accurate text will have lower perplexity than one that produces random or nonsensical output.
In the context of LLM quantization (e.g., reducing model precision to save resources), perplexity is used to check if the compressed model retains its language understanding. Generally the PPL of the unquantized model is expected to be lower than the PPL of a quantized version.
However, in quantization-aware training (QAT), the model is trained to handle lower-precision weights (e.g., from bf16 to int4) during training, simulating the effects of quantization. This helps the model adapt to the reduced precision, potentially maintaining performance even after quantization.
The PPL of the unquantized bf16 model might not always be lower because the quantized model (trained with QAT) might retain performance close to the original bf16 model. If QAT is effective, the quantized model’s PPL could be similar to or even higher than the original, meaning the unquantized model’s PPL isn’t necessarily lower.
KL-Divergence (KLD) is a statistical measure that quantifies how different two probability distributions are. In the context of Large Language Models (LLMs) and quantization, it’s used to compare how a compressed (quantized) model differs from the original (unquantized) model in terms of their output probabilities.
If two models produce nearly identical predictions (e.g., same probabilitie for words in a sentence), their KLD is low. If their predictions diverge significantly (e.g., the quantized model chooses different word more often), the KLD is high.
Typically a very large KLD baseline data file is generated on the original (or least quantized) version of the model. This baseline is then compared against quantized versions to measure KLD as well as Δp.
Δp Token Probability Distribution Difference refers to the difference in token probability distributions between an unquantized (full-precision) model and a quantized model. It measures how much the probabilities assigned to individual tokens (e.g., words or subwords) change after quantization.
For example, for each token in a given input sequence, the unquantized model computes a probability distribution over the vocabulary (e.g., "the" has 10% chance, "cat" has 5%, etc.). The quantized model (e.g., IQ4_K
or Q2_K_L
) computes a similar distribution, but due to precision loss, the probabilities may shift. Δp is the absolute or relative difference between these two distributions for each token.
A specific example would be that the unquantized model assigns 0.2 to "cat" and the quantized model assigns 0.15, the Δp for "cat" is 0.05.
👈 Benchmark Suites
GPQA Diamond Set: A subset of 198 high-objectivity, challenging multiple-choice questions designed for advanced testing. Difficulty aligns with college-level or higher expertise in biology, physics, and chemistry. Intended for evaluating AI systems' ability to handle complex, domain-specific tasks requiring deep knowledge and critical thinking.
MBPP Mostly Basic Programming Problems is a benchmark dataset designed to evaluate large language models (LLMs) on programming tasks focusing on Python code. The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases.
MMLU-Pro is an enhanced benchmark designed to evaluate language understanding models across broader and more challenging tasks. Building on the Massive Multitask Language Understanding (MMLU) dataset, MMLU-Pro integrates more challenging, reasoning-focused questions and increases the answer choices per question from four to ten, significantly raising the difficulty and reducing the chance of success through random guessing. MMLU-Pro comprises over 12,000 rigorously curated questions from academic exams and textbooks, spanning 14 diverse domains including Biology, Business, Chemistry, Computer Science, Economics, Engineering, Health, History, Law, Math, Philosophy, Physics, Psychology, and Others.
MT-Bench is a benchmark designed to evaluate the multi-turn conversational abilities and instruction-following skills of large language models (LLMs). Unlike traditional benchmarks that focus on closed-ended tasks (e.g., multiple-choice questions), MT-Bench emphasizes open-ended, real-world interactions to measure how well models handle complex, dynamic dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks.
MixEval is a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU), with its queries being stably and effortlessly updated every month to avoid contamination.
- ik_llama.cpp Qwen3 Quants Discussion
- calibration_data_v5_rc.txt - ubergarm uses tristandruyen's imatrix calibration dataset
- wiki.test.raw.gz
ubergarm-kld-test-corpus.txt
- Private gist available upon request if you don't use it for training or fine-tuning or imatrix calibration etc.- visualization of Qwen3-30B-A3B imatrix statistics
Thank you for the analysis! Would love to see the full raw data :)