The Great Quant Wars of 2025

"All things leave behind them the Obscurity... and go forward to embrace the Brightness..." — Dao De Jing #42

tl;dr;

Q: Who provides the best GGUFs now?
A: They're all pretty good.

Skip down if you just want graphs and numbers comparing various Qwen3-30B-A3B GGUF quants.

Background

It's been well over a year since TheBloke uploaded his last quant to huggingface. The LLM landscape has changed markedly since then with many new models being released monthly, new inference engines targeting specific hardware optimizations, and ongoing evolution of quantization algorithims. Our community continues to grow and diversify at an amazing rate.

Fortunately, many folks and organizations have kindly stepped-up to keep the quants cooking so we can all find an LLM sized just right to fit on our home rigs. Amongst them bartowski, and unsloth (Daniel and Michael's start-up company), have become the new "household names" for providing a variety of GGUF quantizations for popular model releases and even all those wild creative fine-tunes! (There are many more including team mradermacher and too many to list everyone, sorry!)

Until recently most GGUF style quants' recipes were "static" meaning that all the tensors and layers were quantized the same e.g. Q8_0 or with consistent patterns defined in llama.cpp's code. So all quants of a given size were mostly the same regardless of who cooked and uploaded it to huggingface.

Things began to change over a year ago with major advancements like importance matrix quantizations by ikawrakow in llama.cpp PR#4861 as well as new quant types (like the perennial favorite IQ4_XS) which have become the mainstay for users of llama.cpp, ollama, koboldcpp, lmstudio, etc. The entire GGUF ecosystem owes a big thanks to not just to ggerganov but also ikawrakow (as well as the many more contributors).

Very recently unsloth introduced a few changes to their quantization methodology that combine different imatrix calibration texts and context lengths along with making some tensors/layers different sizes than the regular llama.cpp code (they had a public fork with their branch, but have to update and re-push due to upstream changes). They have named this change in standard methodology Unsloth Dynamic 2.0 GGUFs as part of their start-up company's marketing strategy.

Around the same time bartowski has been experimenting with different imatrix calibration texts and opened a PR to llama.cpp modifying the default tensor/layer quantization recipes. I myself began experimenting with custom "dynamic" quantization recipes using ikawrakow's latest SOTA quants like iq4_k which to-date only work on his ik_llama.cpp fork.

While this is great news for all GGUF enjoyers, the friendly competition and additional options have led to some confusion and I dare say some "tribalism". (If part of your identity as a person depends on downloading quants from only one source, I suggest you google: "Nan Yar?").

So how can you, dear reader, decide which is the best quant of a given model for you to download? unsloth already did a great blog post discussing their own benchmarks and metrics. Open a tab to check out u/AaronFeng47's many other benchmarks. And finally, this post contains even more metrics and benchmarks. The best answer I have is "Nullius in verba, (Latin for "take nobody's word for it") — even my word!

Unfortunately, this means there is no one-size-fits-all rule, "X" is not always better than "Y", and if you want to min-max-optimize your LLM for your specific use case on your specific hardware you probably will have to experiment and think critically. If you don't care too much, then pick the any of biggest quants that fit on your rig for the desired context length and you'll be fine because: they're all pretty good.

And with that, let's dive into the Qwen3-30B-A3B benchmarks below!

Quick Thanks

Shout out to Wendell and the Level1Techs crew, the L1T Forums, and the L1T YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make great quants available to the community!!!

Graphs

👈 Qwen3-30B-A3B Benchmark Suite Graphs

Note <think> mode was disabled for these tests to speed up benchmarking.

👈 Qwen3-30B-A3B Perplexity and KLD Graphs

Using the BF16 as baseline for KLD stats. Also note the perplexity was lowest ("best") for models other than the bf16 which is not typically the case unless there was possibly some QAT going on. As such, the chart is relative to the lowest perplexity score: PPL/min(PPL)-1 plus a small eps for scaling.

Perplexity

wiki.test.raw (lower is "better")

ubergarm-kdl-test-corpus.txt (lower is "better")

KLD Stats

(lower is "better")

Δp Stats

(lower is "better")

👈 Qwen3-235B-A22B Perplexity and KLD Graphs

Not as many data points here but just for comparison. Keep in mind the Q8_0 was the baseline for KLD stats given I couldn't easily run the full BF16.

Perplexity

wiki.test.raw (lower is "better")

ubergarm-kdl-test-corpus.txt (lower is "better")

KLD Stats

(lower is "better")

Δp Stats

(lower is "better")

👈 Qwen3-30B-A3B Speed llama-sweep-bench Graphs

llama-sweep-bench

llama.cpp

ik_llama.cpp

NOTE: Keep in mind ik's fork is faster than mainline llama.cpp for many architectures and configurations especially only-CPU, hybrid-CPU+GPU, and DeepSeek MLA cases.

Methodology

👈 Perplexity, KLD, and imatrix Methodology

PPL and KLD testing done with ik_llama.cpp@9ba36270.

Perplexity

I adjust ngl and threads for larger 235B models.

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-perplexity \
    -m "$model" \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1

KLD

I adjust ngl and threads for larger 235B models. For 235B I had to use the Q8_0 as the baseline given this rig can't easily run the full 400+GiB BF16.

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-perplexity \
    -m "$model" \
    --kl-divergence-base /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-BF16-ubergarm-kld-test-corpus-base.dat \
    --kl-divergence \
    -f ubergarm-kld-test-corpus.txt \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1

imatrix

This is how I make my imatrix using ik_llama.cpp to additionaly print out cosine similarity data to inform possible custom quant strategies. I haven't seen how exactly unsloth makes their new recipe.

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-imatrix \
    --verbosity 1 \
    --layer-similarity \
    -m /mnt/raid/models/Qwen/Qwen3-30B-A3B/Qwen3-30B-A3B-BF16-00001-of-00002.gguf \
    -f calibration_data_v5_rc.txt \
    -o /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/imatrix-Qwen3-30B-A3B.dat \
    --ctx-size 512 \
    -ngl 36 \
    --threads 16

======================== sorted layer importances
  0: Layer   0, <cos_sim> = 0.32154
  1: Layer  47, <cos_sim> = 0.38473
  2: Layer   1, <cos_sim> = 0.736987
  3: Layer  28, <cos_sim> = 0.845492
  4: Layer   2, <cos_sim> = 0.847391
  5: Layer  29, <cos_sim> = 0.859291
  6: Layer   7, <cos_sim> = 0.861405
  7: Layer   3, <cos_sim> = 0.878313
  8: Layer   8, <cos_sim> = 0.893971
  9: Layer   6, <cos_sim> = 0.900308
 10: Layer  42, <cos_sim> = 0.911525
 11: Layer   5, <cos_sim> = 0.912156
 12: Layer  17, <cos_sim> = 0.913169
 13: Layer   4, <cos_sim> = 0.914095
 14: Layer  13, <cos_sim> = 0.92175
 15: Layer  46, <cos_sim> = 0.925283
 16: Layer  19, <cos_sim> = 0.926845
 17: Layer  18, <cos_sim> = 0.927019
 18: Layer  45, <cos_sim> = 0.928896
 19: Layer  40, <cos_sim> = 0.934481
 20: Layer  31, <cos_sim> = 0.934585
 21: Layer  14, <cos_sim> = 0.936932
 22: Layer  16, <cos_sim> = 0.940338
 23: Layer  25, <cos_sim> = 0.940477
 24: Layer  10, <cos_sim> = 0.942312
 25: Layer  38, <cos_sim> = 0.943166
 26: Layer   9, <cos_sim> = 0.943843
 27: Layer  11, <cos_sim> = 0.944233
 28: Layer  37, <cos_sim> = 0.944325
 29: Layer  20, <cos_sim> = 0.94612
 30: Layer  22, <cos_sim> = 0.946449
 31: Layer  41, <cos_sim> = 0.946775
 32: Layer  39, <cos_sim> = 0.947228
 33: Layer  44, <cos_sim> = 0.947687
 34: Layer  30, <cos_sim> = 0.947942
 35: Layer  23, <cos_sim> = 0.949102
 36: Layer  12, <cos_sim> = 0.951618
 37: Layer  21, <cos_sim> = 0.951701
 38: Layer  24, <cos_sim> = 0.952261
 39: Layer  43, <cos_sim> = 0.953357
 40: Layer  27, <cos_sim> = 0.953528
 41: Layer  26, <cos_sim> = 0.95575
 42: Layer  32, <cos_sim> = 0.956024
 43: Layer  15, <cos_sim> = 0.956915
 44: Layer  35, <cos_sim> = 0.959861
 45: Layer  36, <cos_sim> = 0.960591
 46: Layer  34, <cos_sim> = 0.961539
 47: Layer  33, <cos_sim> = 0.968161

======================== sorted attention importances
  0: Layer   0, <cos_sim> = 0.353019
  1: Layer  45, <cos_sim> = 0.638476
  2: Layer   1, <cos_sim> = 0.674894
  3: Layer  29, <cos_sim> = 0.686547
  4: Layer  17, <cos_sim> = 0.708034
  5: Layer   3, <cos_sim> = 0.718456
  6: Layer  21, <cos_sim> = 0.72082
  7: Layer  44, <cos_sim> = 0.732611
  8: Layer  22, <cos_sim> = 0.738435
  9: Layer  18, <cos_sim> = 0.742531
 10: Layer  42, <cos_sim> = 0.745018
 11: Layer   8, <cos_sim> = 0.746792
 12: Layer  24, <cos_sim> = 0.750162
 13: Layer  23, <cos_sim> = 0.750384
 14: Layer   9, <cos_sim> = 0.754324
 15: Layer  46, <cos_sim> = 0.758528
 16: Layer  33, <cos_sim> = 0.76019
 17: Layer  47, <cos_sim> = 0.760449
 18: Layer  27, <cos_sim> = 0.760966
 19: Layer   4, <cos_sim> = 0.761774
 20: Layer   2, <cos_sim> = 0.762337
 21: Layer   6, <cos_sim> = 0.763453
 22: Layer  34, <cos_sim> = 0.765167
 23: Layer  30, <cos_sim> = 0.768629
 24: Layer  25, <cos_sim> = 0.768819
 25: Layer  26, <cos_sim> = 0.769841
 26: Layer  20, <cos_sim> = 0.77039
 27: Layer  10, <cos_sim> = 0.772251
 28: Layer  41, <cos_sim> = 0.773975
 29: Layer  35, <cos_sim> = 0.774599
 30: Layer  43, <cos_sim> = 0.775401
 31: Layer  11, <cos_sim> = 0.776914
 32: Layer  28, <cos_sim> = 0.778543
 33: Layer  19, <cos_sim> = 0.781975
 34: Layer  36, <cos_sim> = 0.78645
 35: Layer  32, <cos_sim> = 0.790626
 36: Layer  15, <cos_sim> = 0.795375
 37: Layer  12, <cos_sim> = 0.797279
 38: Layer  16, <cos_sim> = 0.797483
 39: Layer  14, <cos_sim> = 0.797921
 40: Layer   7, <cos_sim> = 0.80098
 41: Layer   5, <cos_sim> = 0.802361
 42: Layer  37, <cos_sim> = 0.805299
 43: Layer  13, <cos_sim> = 0.806054
 44: Layer  31, <cos_sim> = 0.807454
 45: Layer  38, <cos_sim> = 0.808983
 46: Layer  40, <cos_sim> = 0.813216
 47: Layer  39, <cos_sim> = 0.816557

======================== sorted ffn importances
  0: Layer  47, <cos_sim> = 0.613059
  1: Layer  44, <cos_sim> = 0.630819
  2: Layer   0, <cos_sim> = 0.653987
  3: Layer  28, <cos_sim> = 0.686159
  4: Layer  16, <cos_sim> = 0.693473
  5: Layer   7, <cos_sim> = 0.694612
  6: Layer  43, <cos_sim> = 0.710648
  7: Layer  20, <cos_sim> = 0.71511
  8: Layer  21, <cos_sim> = 0.715567
  9: Layer  46, <cos_sim> = 0.71785
 10: Layer  45, <cos_sim> = 0.718143
 11: Layer   1, <cos_sim> = 0.726385
 12: Layer   3, <cos_sim> = 0.735632
 13: Layer   8, <cos_sim> = 0.736597
 14: Layer   2, <cos_sim> = 0.737616
 15: Layer  22, <cos_sim> = 0.739272
 16: Layer  33, <cos_sim> = 0.739951
 17: Layer  19, <cos_sim> = 0.740003
 18: Layer   9, <cos_sim> = 0.742748
 19: Layer  32, <cos_sim> = 0.747542
 20: Layer  23, <cos_sim> = 0.749229
 21: Layer  24, <cos_sim> = 0.755807
 22: Layer  41, <cos_sim> = 0.75653
 23: Layer  10, <cos_sim> = 0.757337
 24: Layer  34, <cos_sim> = 0.758472
 25: Layer  31, <cos_sim> = 0.759585
 26: Layer  40, <cos_sim> = 0.763913
 27: Layer  17, <cos_sim> = 0.768032
 28: Layer  26, <cos_sim> = 0.768999
 29: Layer  18, <cos_sim> = 0.771782
 30: Layer   6, <cos_sim> = 0.776553
 31: Layer   4, <cos_sim> = 0.777394
 32: Layer  27, <cos_sim> = 0.777827
 33: Layer  35, <cos_sim> = 0.778635
 34: Layer  42, <cos_sim> = 0.779552
 35: Layer  36, <cos_sim> = 0.779963
 36: Layer  25, <cos_sim> = 0.785371
 37: Layer  12, <cos_sim> = 0.785794
 38: Layer  29, <cos_sim> = 0.787757
 39: Layer   5, <cos_sim> = 0.79259
 40: Layer  11, <cos_sim> = 0.793774
 41: Layer  15, <cos_sim> = 0.796992
 42: Layer  30, <cos_sim> = 0.797935
 43: Layer  14, <cos_sim> = 0.7999
 44: Layer  39, <cos_sim> = 0.806665
 45: Layer  38, <cos_sim> = 0.813561
 46: Layer  13, <cos_sim> = 0.820982
 47: Layer  37, <cos_sim> = 0.830343

👈 Benchmarking Methodology

Benchmark Suite

The benchmark client used is bartowski's patched evalchemy fork containing fixes for easier use across a variety of LLM server API endpoints.

Benchmark test suite testing done with llama.cpp@36667c8e on a subset of models.

For llama.cpp server:

cd llama.cpp
git checkout 36667c8e
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)

model=/mnt/raid/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-IQ2_M.gguf
name=bartowski/Qwen3-30B-A3B-IQ2_M

CUDA_VISIBLE_DEVICES="1" \
./build/bin/llama-server \
  --model "$model" \
  --alias "$name" \
  --api-key super-secret-change-me \
  -fa \
  -ctk f16 -ctv f16 \
  -c 262144 \
  --parallel 8 \
  --slots \
  -ngl 99 \
  --threads 1 \
  --host 127.0.0.1 \
  --port 8088

For ik_llama.cpp server:

cd ik_llama.cpp
git checkout e3fec173
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)

model=/mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-mix-IQ4_K.gguf
name=ubergarm/Qwen3-30B-A3B-mix-IQ4_K

CUDA_VISIBLE_DEVICES="1" \
./build/bin/llama-server \
  --model "$model" \
  --alias "$name" \
  --api-key super-secret-change-me \
  -fmoe \
  -fa \
  -ctk f16 -ctv f16 \
  -c 262144 \
  --parallel 8 \
  -ngl 99 \
  --threads 1 \
  --host 127.0.0.1 \
  --port 8088

For vllm server:

CUDA_VISIBLE_DEVICES="1" \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
VLLM_USE_MODELSCOPE=True \
vllm \
  serve swift/Qwen3-30B-A3B-AWQ \
  --served-model-name Qwen3-30B-A3B-AWQ \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --max-num-seqs 64 \
  --api-key super-secret-change-me \
  --host 127.0.0.1 \
  --port 8080

👈 Speed Benchmark Methodology

Note probably no warmup, I saw a PR on ik's fork about it so the first data point trends low.

cd llama.cpp
git ug/port-sweep-bench
# llama.cpp@814f795e + ug/port-sweep-bench
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)

#model=/mnt/astrodata/llm/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
#model=/mnt/astrodata/llm/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-Q2_K_L.gguf
#model=/mnt/astrodata/llm/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-IQ2_M.gguf

#model=/mnt/astrodata/llm/models/unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-Q2_K_XL.gguf
#model=/mnt/astrodata/llm/models/unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-IQ2_M.gguf
model=/mnt/astrodata/llm/models/unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-Q4_K_XL.gguf

CUDA_VISIBLE_DEVICE=0 \
./build/bin/llama-sweep-bench \
    --model "$model" \
    -fa \
    -ctk f16 -ctv f16 \
    -c 32768 \
    -ngl 99 \
    --threads 1 \

Raw Data

👈 Perplexity, KLD, and Δp Raw Data Table

Parsed this data from a bunch of logs generated above. It is not in the most beautiful order so feel free to copy paste into google docs or however you'd like to make your own graphs.

Model	Size	0.1% Δp	1.0% KLD	1.0% Δp	10.0% KLD	10.0% Δp	25.0% Δp	5.0% KLD	5.0% Δp	75.0% Δp	90.0% Δp	95.0% Δp	99.0% KLD	99.0% Δp	99.9% KLD	99.9% Δp	Maximum KLD	Maximum Δp	Mean KLD	Mean KLD uncertainty	Mean Δp	Mean Δp uncertainty	Mean PPL(Q) ubergarm-kld-test-corpus.txt	Mean PPL(Q) uncertainty ubergarm-kld-test-corpus.txt	Median KLD	Median Δp	Minimum KLD	Minimum Δp	PPL uncertainty wiki.test.raw	PPL wiki.test.raw	RMS Δp	RMS Δp uncertainty	Same top p	Same top p uncertainty
Qwen/Qwen3-235B-A22B-BF16	438
ubergarm/Qwen3-235B-A22B-Q8_0	233																						11.7194	0.07212					0.03321	5.3141
ubergarm/Qwen3-235B-A22B-mix-IQ3_K	107	-18.276%	0.000036	-8.542%	0.000940	-2.631%	-0.686%	0.000310	-4.272%	0.587%	2.504%	4.175%	0.098368	8.257%	0.296680	17.122%	2.906263	63.764%	0.014594	0.000064	-0.049	0.006	11.788282	0.072648	0.008979	-0.001%	-0.000039	-72.329%	0.03421	5.4403	2.846	0.017	93.459	0.056
lmstudio-community/Qwen3-235B-A22B-Q3_K_L	104	-27.956%	0.000083	-14.266%	0.002466	-4.579%	-1.294%	0.000766	-7.290%	0.786%	3.742%	6.267%	0.219563	12.470%	0.628216	24.126%	8.358958	77.349%	0.036266	0.000140	-0.284	0.010	11.904309	0.073302	0.023930	-0.010%	-0.000003	-99.970%	0.03584	5.6582	4.496	0.025	89.756	0.069
unsloth/Qwen3-235B-A22B-UD-Q3_K_XL	97	-25.243%	0.000060	-12.180%	0.001945	-3.752%	-0.962%	0.000612	-6.159%	0.874%	3.649%	5.976%	0.180988	11.713%	0.543533	22.421%	5.471307	64.130%	0.029122	0.000123	-0.059	0.009	11.855173	0.073300	0.018888	-0.000%	-0.000004	-98.693%	0.03524	5.5695	4.018	0.023	90.694	0.066
Qwen/Qwen3-30B-A3B-BF16	56.9																						15.1443	0.10239					0.07223	9.0703
ubergarm/Qwen3-30B-A3B-Q8_0	30.3	-7.050%	0.000001	-3.834%	0.000154	-1.241%	-0.282%	0.000038	-2.035%	0.231%	1.176%	1.964%	0.013699	3.763%	0.039718	7.128%	0.359152	28.466%	0.002337	0.000009	-0.020	0.003	15.152095	0.102398	0.001587	-0.000%	-0.000047	-34.379%	0.07228	9.0740	1.279	0.008	96.972	0.039
ubergarm/Qwen3-30B-A3B-mix-IQ4_K	17.7	-11.731%	0.000004	-5.522%	0.000298	-1.645%	-0.376%	0.000080	-2.742%	0.326%	1.592%	2.682%	0.032109	5.373%	0.104454	10.626%	2.514502	39.508%	0.004821	0.000024	-0.025	0.004	15.218819	0.103071	0.002970	-0.000%	-0.000048	-44.213%	0.07278	9.1184	1.818	0.011	95.945	0.045
bartowski/Qwen3-30B-A3B-Q4_K_M	17.4	-16.135%	0.000008	-8.303%	0.000652	-2.643%	-0.645%	0.000171	-4.286%	0.398%	2.084%	3.570%	0.063238	7.356%	0.195169	14.392%	5.985787	61.522%	0.010136	0.000053	-0.158	0.006	15.194468	0.102605	0.006434	-0.001%	-0.000032	-88.357%	0.07381	9.2092	2.619	0.018	94.329	0.053
bartowski/Qwen3-30B-A3B-Q4_K_S	16.8	-18.122%	0.000013	-9.230%	0.000862	-3.006%	-0.780%	0.000235	-4.787%	0.402%	2.215%	3.866%	0.077885	7.972%	0.233980	15.420%	5.971601	66.795%	0.012915	0.000065	-0.227	0.007	15.202408	0.102513	0.008261	-0.002%	-0.000038	-87.019%	0.07371	9.2232	2.885	0.019	93.804	0.055
unsloth/Qwen3-30B-A3B-UD-Q4_K_XL	16.5	-21.984%	0.000015	-11.111%	0.001152	-3.508%	-0.938%	0.000315	-5.582%	0.421%	2.460%	4.261%	0.102021	8.910%	0.305740	17.384%	5.570370	67.990%	0.016495	0.000071	-0.320	0.008	15.281833	0.103140	0.010432	-0.005%	-0.000016	-85.356%	0.07290	9.1688	3.333	0.020	93.169	0.058
ubergarm/Qwen3-30B-A3B-IQ4_KS	15.5	-20.721%	0.000018	-10.000%	0.001003	-3.073%	-0.796%	0.000292	-5.017%	0.442%	2.398%	4.167%	0.094074	8.691%	0.282245	16.987%	6.828948	89.561%	0.014617	0.000068	-0.209	0.007	15.182811	0.102278	0.008934	-0.003%	-0.000031	-75.475%	0.07061	8.9862	3.106	0.019	93.625	0.056
ikawrakow/Qwen3-30B-A3B-IQ4_KS-Bartowski	15.3	-20.846%	0.000021	-10.497%	0.001098	-3.434%	-0.905%	0.000316	-5.433%	0.421%	2.427%	4.216%	0.099815	8.719%	0.290617	17.546%	6.971420	81.571%	0.015818	0.000074	-0.288	0.007	15.150462	0.101931	0.009988	-0.004%	-0.000029	-86.592%	0.07078	9.0016	3.244	0.020	93.317	0.057
ikawrakow/Qwen3-30B-A3B-IQ4_KS-IK	15.3	-21.414%	0.000026	-10.689%	0.001192	-3.461%	-0.959%	0.000352	-5.489%	0.405%	2.383%	4.163%	0.102473	8.750%	0.301946	17.416%	7.146766	58.365%	0.016277	0.000074	-0.323	0.007	15.161535	0.101972	0.010269	-0.006%	-0.000007	-90.822%	0.07094	9.0177	3.265	0.019	93.216	0.057
ikawrakow/Qwen3-30B-A3B-IQ4_KS-Unslolth	15.3	-21.919%	0.000023	-11.082%	0.001218	-3.610%	-1.015%	0.000351	-5.698%	0.396%	2.355%	4.173%	0.104796	8.799%	0.314624	18.042%	7.383745	78.742%	0.016845	0.000077	-0.366	0.008	15.109454	0.101327	0.010667	-0.006%	-0.000012	-86.065%	0.06945	8.9171	3.331	0.020	93.217	0.057
unsloth/Qwen3-30B-A3B-UD-IQ2_M	10.1	-47.141%	0.000072	-22.803%	0.004283	-6.698%	-1.739%	0.001229	-11.071%	0.843%	4.934%	8.514%	0.457244	17.671%	1.370219	34.262%	8.153114	88.509%	0.066646	0.000267	-0.607	0.015	15.889509	0.107834	0.039668	-0.011%	-0.000011	-99.283%	0.08541	10.3726	6.627	0.033	87.029	0.077
bartowski/Qwen3-30B-A3B-IQ2_M	9.7	-48.093%	0.000068	-24.583%	0.005231	-8.541%	-2.590%	0.001459	-13.210%	0.538%	4.031%	7.477%	0.432021	16.466%	1.262156	31.659%	8.695639	80.027%	0.069100	0.000258	-1.300	0.016	15.436905	0.102661	0.044448	-0.039%	-0.000004	-96.452%	0.08036	9.9788	6.979	0.033	86.303	0.079

👈 Benchmark Suite Raw Data Table

TODO copy/paste it all somewhere if there is enough interest.

👈 llama-sweep-bench Speed Data

bartowski/Q4_K_M

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.186	2746.40	0.912	140.37
512	128	512	0.189	2709.05	0.941	135.99
512	128	1024	0.190	2689.73	0.940	136.22
512	128	1536	0.195	2631.96	0.943	135.78
512	128	2048	0.197	2601.24	0.957	133.69
512	128	2560	0.201	2553.51	0.959	133.43
512	128	3072	0.203	2526.21	0.966	132.56
512	128	3584	0.207	2472.32	0.976	131.16
512	128	4096	0.210	2432.41	0.986	129.80
512	128	4608	0.213	2406.39	0.996	128.50
512	128	5120	0.215	2385.53	1.008	126.99
512	128	5632	0.218	2347.09	1.018	125.72
512	128	6144	0.221	2321.62	1.029	124.44
512	128	6656	0.224	2287.95	1.041	123.02
512	128	7168	0.227	2252.04	1.053	121.57
512	128	7680	0.231	2218.25	1.065	120.17
512	128	8192	0.233	2194.17	1.075	119.04
512	128	8704	0.235	2175.86	1.086	117.92
512	128	9216	0.240	2133.00	1.099	116.47
512	128	9728	0.241	2126.89	1.109	115.46
512	128	10240	0.245	2089.25	1.120	114.25
512	128	10752	0.249	2055.28	1.164	109.96
512	128	11264	0.252	2032.46	1.181	108.43
512	128	11776	0.254	2011.96	1.171	109.29
512	128	12288	0.257	1993.13	1.175	108.95
512	128	12800	0.260	1970.94	1.184	108.08
512	128	13312	0.264	1939.95	1.186	107.95
512	128	13824	0.265	1930.30	1.194	107.24
512	128	14336	0.270	1897.48	1.197	106.89
512	128	14848	0.272	1880.96	1.204	106.32
512	128	15360	0.276	1856.05	1.214	105.45
512	128	15872	0.279	1832.42	1.221	104.82
512	128	16384	0.283	1809.73	1.229	104.13
512	128	16896	0.285	1796.89	1.234	103.69
512	128	17408	0.288	1778.96	1.242	103.08
512	128	17920	0.293	1746.74	1.249	102.52
512	128	18432	0.296	1729.58	1.256	101.89
512	128	18944	0.298	1715.59	1.264	101.23
512	128	19456	0.302	1697.53	1.269	100.87
512	128	19968	0.304	1684.14	1.278	100.13
512	128	20480	0.307	1665.46	1.284	99.71
512	128	20992	0.311	1644.88	1.291	99.12
512	128	21504	0.314	1631.38	1.334	95.97
512	128	22016	0.317	1613.83	1.347	95.01
512	128	22528	0.321	1596.46	1.339	95.57
512	128	23040	0.322	1589.42	1.345	95.16
512	128	23552	0.325	1573.55	1.352	94.64
512	128	24064	0.329	1556.41	1.358	94.25
512	128	24576	0.333	1537.96	1.363	93.93
512	128	25088	0.335	1529.21	1.369	93.52
512	128	25600	0.340	1506.80	1.378	92.91
512	128	26112	0.343	1494.38	1.383	92.54
512	128	26624	0.347	1476.69	1.392	91.98
512	128	27136	0.350	1464.63	1.398	91.53
512	128	27648	0.353	1451.77	1.405	91.13
512	128	28160	0.355	1442.42	1.411	90.69
512	128	28672	0.359	1427.94	1.418	90.26
512	128	29184	0.362	1415.01	1.426	89.77
512	128	29696	0.364	1406.75	1.433	89.33
512	128	30208	0.367	1393.57	1.441	88.84
512	128	30720	0.371	1379.72	1.450	88.27
512	128	31232	0.374	1367.29	1.456	87.93
512	128	31744	0.378	1355.16	1.464	87.43
512	128	32256	0.381	1343.89	1.507	84.94

bartowski/Q2_K_L

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.219	2342.04	0.940	136.14
512	128	512	0.221	2320.24	0.968	132.17
512	128	1024	0.222	2302.08	0.968	132.25
512	128	1536	0.228	2245.09	0.976	131.11
512	128	2048	0.230	2230.09	0.990	129.34
512	128	2560	0.233	2201.35	0.998	128.21
512	128	3072	0.236	2168.36	1.005	127.38
512	128	3584	0.240	2128.94	1.014	126.18
512	128	4096	0.243	2102.88	1.025	124.82
512	128	4608	0.245	2093.47	1.035	123.68
512	128	5120	0.248	2062.11	1.045	122.44
512	128	5632	0.251	2042.84	1.057	121.12
512	128	6144	0.254	2016.60	1.069	119.78
512	128	6656	0.256	1996.33	1.081	118.46
512	128	7168	0.260	1965.62	1.090	117.42
512	128	7680	0.264	1939.11	1.103	116.03
512	128	8192	0.267	1917.69	1.114	114.86
512	128	8704	0.269	1902.68	1.123	113.97
512	128	9216	0.275	1864.88	1.139	112.41
512	128	9728	0.275	1864.80	1.149	111.43
512	128	10240	0.280	1831.10	1.173	109.12
512	128	10752	0.282	1813.40	1.209	105.90
512	128	11264	0.286	1792.80	1.224	104.61
512	128	11776	0.289	1769.64	1.217	105.19
512	128	12288	0.291	1756.56	1.219	104.97
512	128	12800	0.296	1730.89	1.230	104.08
512	128	13312	0.298	1717.56	1.231	103.94
512	128	13824	0.299	1709.78	1.237	103.48
512	128	14336	0.304	1684.98	1.241	103.15
512	128	14848	0.306	1672.32	1.247	102.63
512	128	15360	0.309	1657.69	1.251	102.28
512	128	15872	0.312	1642.84	1.258	101.72
512	128	16384	0.316	1620.66	1.265	101.16
512	128	16896	0.319	1603.11	1.271	100.68
512	128	17408	0.322	1592.25	1.280	100.04
512	128	17920	0.325	1573.98	1.286	99.52
512	128	18432	0.328	1560.54	1.295	98.82
512	128	18944	0.331	1547.27	1.303	98.27
512	128	19456	0.336	1525.32	1.308	97.87
512	128	19968	0.336	1523.96	1.317	97.16
512	128	20480	0.339	1509.92	1.323	96.72
512	128	20992	0.342	1498.56	1.328	96.36
512	128	21504	0.344	1487.29	1.368	93.54
512	128	22016	0.348	1469.52	1.386	92.32
512	128	22528	0.351	1458.22	1.377	92.95
512	128	23040	0.354	1447.65	1.383	92.56
512	128	23552	0.357	1434.13	1.392	91.95
512	128	24064	0.361	1417.81	1.397	91.60
512	128	24576	0.365	1401.75	1.400	91.40
512	128	25088	0.367	1395.82	1.408	90.89
512	128	25600	0.369	1387.75	1.412	90.67
512	128	26112	0.374	1368.77	1.418	90.29
512	128	26624	0.377	1359.02	1.427	89.71
512	128	27136	0.380	1347.28	1.434	89.25
512	128	27648	0.383	1336.61	1.439	88.92
512	128	28160	0.387	1322.05	1.446	88.50
512	128	28672	0.389	1315.73	1.454	88.02
512	128	29184	0.392	1307.57	1.461	87.58
512	128	29696	0.395	1295.59	1.468	87.16
512	128	30208	0.400	1281.33	1.475	86.77
512	128	30720	0.403	1269.72	1.485	86.17
512	128	31232	0.406	1260.77	1.493	85.75
512	128	31744	0.411	1245.97	1.499	85.37
512	128	32256	0.411	1244.60	1.538	83.20

bartowski/IQ2_M

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.199	2571.39	0.929	137.72
512	128	512	0.200	2558.87	0.958	133.66
512	128	1024	0.205	2502.88	0.958	133.60
512	128	1536	0.209	2449.39	0.966	132.45
512	128	2048	0.211	2424.91	0.979	130.70
512	128	2560	0.214	2387.42	0.981	130.51
512	128	3072	0.217	2359.21	0.990	129.36
512	128	3584	0.220	2322.95	1.001	127.93
512	128	4096	0.224	2281.51	1.011	126.63
512	128	4608	0.226	2264.66	1.020	125.44
512	128	5120	0.228	2246.85	1.031	124.21
512	128	5632	0.231	2218.24	1.040	123.07
512	128	6144	0.235	2177.99	1.054	121.47
512	128	6656	0.237	2158.85	1.065	120.14
512	128	7168	0.241	2124.91	1.078	118.72
512	128	7680	0.245	2088.47	1.094	116.98
512	128	8192	0.248	2066.12	1.106	115.68
512	128	8704	0.250	2044.39	1.117	114.54
512	128	9216	0.253	2023.04	1.130	113.27
512	128	9728	0.256	2002.81	1.141	112.18
512	128	10240	0.259	1980.01	1.154	110.94
512	128	10752	0.263	1945.18	1.198	106.84
512	128	11264	0.265	1928.54	1.211	105.70
512	128	11776	0.268	1908.01	1.204	106.28
512	128	12288	0.271	1891.82	1.207	106.08
512	128	12800	0.275	1861.92	1.216	105.27
512	128	13312	0.277	1846.15	1.219	104.99
512	128	13824	0.280	1829.45	1.226	104.43
512	128	14336	0.283	1807.34	1.229	104.17
512	128	14848	0.286	1789.55	1.233	103.77
512	128	15360	0.289	1774.14	1.241	103.12
512	128	15872	0.293	1750.23	1.248	102.55
512	128	16384	0.296	1730.68	1.256	101.88
512	128	16896	0.299	1713.86	1.261	101.49
512	128	17408	0.301	1700.49	1.271	100.72
512	128	17920	0.306	1671.47	1.281	99.93
512	128	18432	0.310	1652.08	1.291	99.17
512	128	18944	0.313	1637.83	1.299	98.53
512	128	19456	0.316	1618.98	1.302	98.32
512	128	19968	0.317	1612.79	1.314	97.42
512	128	20480	0.321	1595.76	1.319	97.04
512	128	20992	0.326	1572.01	1.327	96.43
512	128	21504	0.328	1561.24	1.369	93.51
512	128	22016	0.332	1543.74	1.383	92.57
512	128	22528	0.335	1529.05	1.373	93.23
512	128	23040	0.336	1524.73	1.374	93.17
512	128	23552	0.337	1517.70	1.386	92.33
512	128	24064	0.343	1493.95	1.387	92.27
512	128	24576	0.346	1481.52	1.393	91.88
512	128	25088	0.349	1466.47	1.401	91.37
512	128	25600	0.350	1462.59	1.406	91.06
512	128	26112	0.356	1438.68	1.413	90.61
512	128	26624	0.359	1425.06	1.418	90.29
512	128	27136	0.361	1417.08	1.426	89.75
512	128	27648	0.365	1403.93	1.433	89.33
512	128	28160	0.368	1389.95	1.442	88.74
512	128	28672	0.371	1380.36	1.454	88.02
512	128	29184	0.374	1369.27	1.458	87.79
512	128	29696	0.378	1355.92	1.465	87.36
512	128	30208	0.381	1345.24	1.471	87.01
512	128	30720	0.383	1336.71	1.482	86.39
512	128	31232	0.387	1324.60	1.486	86.11
512	128	31744	0.390	1311.28	1.494	85.65
512	128	32256	0.393	1302.29	1.535	83.40

unsloth/UD-Q4_K_XL

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.185	2771.13	0.895	143.07
512	128	512	0.187	2735.63	0.923	138.71
512	128	1024	0.190	2699.01	0.921	138.95
512	128	1536	0.195	2627.30	0.930	137.64
512	128	2048	0.196	2614.49	0.943	135.73
512	128	2560	0.200	2560.59	0.947	135.10
512	128	3072	0.202	2528.42	0.954	134.19
512	128	3584	0.206	2481.69	0.964	132.77
512	128	4096	0.210	2443.23	0.974	131.47
512	128	4608	0.212	2413.67	0.985	129.96
512	128	5120	0.214	2394.67	0.995	128.61
512	128	5632	0.219	2340.45	1.015	126.14
512	128	6144	0.222	2306.96	1.024	125.01
512	128	6656	0.225	2273.36	1.035	123.64
512	128	7168	0.228	2242.54	1.050	121.92
512	128	7680	0.231	2212.63	1.060	120.71
512	128	8192	0.235	2182.09	1.068	119.82
512	128	8704	0.237	2157.82	1.082	118.25
512	128	9216	0.241	2123.14	1.097	116.72
512	128	9728	0.243	2109.32	1.104	115.90
512	128	10240	0.246	2077.16	1.119	114.35
512	128	10752	0.250	2049.47	1.168	109.62
512	128	11264	0.254	2017.75	1.183	108.21
512	128	11776	0.255	2009.66	1.173	109.13
512	128	12288	0.259	1976.27	1.176	108.86
512	128	12800	0.261	1957.95	1.186	107.97
512	128	13312	0.266	1926.83	1.187	107.84
512	128	13824	0.267	1914.87	1.191	107.45
512	128	14336	0.271	1888.06	1.196	107.00
512	128	14848	0.274	1869.73	1.202	106.49
512	128	15360	0.277	1849.09	1.209	105.84
512	128	15872	0.280	1828.40	1.215	105.35
512	128	16384	0.284	1801.44	1.224	104.57
512	128	16896	0.287	1781.87	1.229	104.13
512	128	17408	0.290	1767.18	1.239	103.35
512	128	17920	0.293	1747.06	1.245	102.83
512	128	18432	0.296	1731.39	1.252	102.25
512	128	18944	0.299	1712.43	1.259	101.64
512	128	19456	0.303	1690.65	1.265	101.17
512	128	19968	0.304	1682.41	1.276	100.31
512	128	20480	0.308	1660.25	1.280	99.99
512	128	20992	0.312	1641.94	1.285	99.57
512	128	21504	0.314	1628.35	1.331	96.17
512	128	22016	0.318	1611.79	1.346	95.11
512	128	22528	0.321	1596.28	1.337	95.72
512	128	23040	0.324	1580.92	1.340	95.54
512	128	23552	0.325	1573.30	1.351	94.74
512	128	24064	0.330	1552.94	1.350	94.81
512	128	24576	0.334	1534.84	1.355	94.48
512	128	25088	0.335	1526.93	1.361	94.06
512	128	25600	0.339	1511.89	1.366	93.70
512	128	26112	0.343	1492.70	1.383	92.55
512	128	26624	0.347	1476.86	1.387	92.27
512	128	27136	0.350	1462.35	1.397	91.63
512	128	27648	0.354	1446.91	1.404	91.16
512	128	28160	0.356	1438.02	1.412	90.66
512	128	28672	0.361	1419.66	1.418	90.26
512	128	29184	0.362	1413.92	1.426	89.77
512	128	29696	0.365	1401.20	1.433	89.32
512	128	30208	0.368	1391.23	1.439	88.97
512	128	30720	0.372	1377.54	1.450	88.29
512	128	31232	0.374	1369.93	1.453	88.09
512	128	31744	0.378	1356.09	1.462	87.56
512	128	32256	0.380	1347.04	1.503	85.14

unsloth/UD-Q2_K_XL

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.211	2423.89	0.943	135.74
512	128	512	0.213	2399.31	0.971	131.85
512	128	1024	0.216	2374.33	0.969	132.11
512	128	1536	0.219	2340.30	0.979	130.80
512	128	2048	0.220	2325.55	0.991	129.11
512	128	2560	0.225	2276.44	0.994	128.82
512	128	3072	0.228	2247.84	1.001	127.89
512	128	3584	0.232	2207.44	1.011	126.60
512	128	4096	0.236	2170.20	1.023	125.06
512	128	4608	0.236	2166.89	1.032	124.00
512	128	5120	0.240	2131.06	1.044	122.58
512	128	5632	0.244	2102.12	1.054	121.41
512	128	6144	0.247	2076.33	1.063	120.43
512	128	6656	0.249	2055.14	1.077	118.82
512	128	7168	0.253	2024.47	1.088	117.68
512	128	7680	0.256	1996.90	1.099	116.45
512	128	8192	0.260	1967.17	1.114	114.93
512	128	8704	0.260	1967.20	1.122	114.06
512	128	9216	0.266	1922.64	1.135	112.81
512	128	9728	0.268	1911.09	1.147	111.63
512	128	10240	0.272	1885.44	1.157	110.64
512	128	10752	0.274	1865.36	1.202	106.45
512	128	11264	0.278	1844.60	1.217	105.18
512	128	11776	0.279	1836.43	1.208	105.93
512	128	12288	0.283	1810.13	1.213	105.57
512	128	12800	0.288	1780.11	1.229	104.16
512	128	13312	0.291	1758.14	1.229	104.12
512	128	13824	0.292	1753.98	1.238	103.39
512	128	14336	0.298	1718.12	1.241	103.10
512	128	14848	0.300	1706.26	1.247	102.61
512	128	15360	0.302	1693.28	1.254	102.07
512	128	15872	0.306	1673.01	1.262	101.46
512	128	16384	0.310	1650.90	1.268	100.96
512	128	16896	0.313	1638.03	1.275	100.41
512	128	17408	0.315	1625.29	1.281	99.90
512	128	17920	0.318	1609.23	1.289	99.31
512	128	18432	0.322	1589.10	1.297	98.68
512	128	18944	0.325	1575.42	1.302	98.29
512	128	19456	0.330	1553.28	1.310	97.73
512	128	19968	0.330	1552.98	1.319	97.05
512	128	20480	0.334	1531.58	1.324	96.67
512	128	20992	0.337	1518.07	1.332	96.12
512	128	21504	0.340	1507.15	1.373	93.25
512	128	22016	0.344	1488.06	1.385	92.41
512	128	22528	0.347	1477.13	1.378	92.88
512	128	23040	0.349	1467.54	1.384	92.47
512	128	23552	0.351	1459.50	1.394	91.80
512	128	24064	0.356	1440.13	1.397	91.61
512	128	24576	0.359	1426.95	1.401	91.36
512	128	25088	0.360	1423.59	1.409	90.82
512	128	25600	0.364	1405.52	1.413	90.62
512	128	26112	0.369	1388.93	1.419	90.18
512	128	26624	0.371	1379.47	1.426	89.79
512	128	27136	0.374	1369.38	1.434	89.28
512	128	27648	0.377	1357.58	1.441	88.85
512	128	28160	0.382	1342.07	1.447	88.44
512	128	28672	0.384	1333.90	1.455	87.99
512	128	29184	0.386	1326.66	1.461	87.62
512	128	29696	0.390	1313.92	1.468	87.22
512	128	30208	0.394	1298.28	1.483	86.34
512	128	30720	0.398	1286.81	1.488	86.02
512	128	31232	0.400	1280.36	1.494	85.70
512	128	31744	0.405	1263.20	1.502	85.21
512	128	32256	0.407	1257.02	1.545	82.83

unsloth/UD-IQ2_M

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.198	2588.92	0.982	130.29
512	128	512	0.199	2574.42	1.008	127.04
512	128	1024	0.203	2527.70	1.007	127.07
512	128	1536	0.206	2488.20	1.017	125.90
512	128	2048	0.207	2468.48	1.031	124.15
512	128	2560	0.211	2427.66	1.037	123.42
512	128	3072	0.215	2376.22	1.045	122.45
512	128	3584	0.218	2344.71	1.055	121.32
512	128	4096	0.220	2323.83	1.066	120.13
512	128	4608	0.224	2286.56	1.075	119.08
512	128	5120	0.226	2263.56	1.086	117.87
512	128	5632	0.229	2233.20	1.097	116.64
512	128	6144	0.231	2216.06	1.108	115.56
512	128	6656	0.235	2174.18	1.125	113.82
512	128	7168	0.239	2141.53	1.137	112.61
512	128	7680	0.244	2099.48	1.148	111.48
512	128	8192	0.245	2087.77	1.160	110.38
512	128	8704	0.247	2076.19	1.170	109.38
512	128	9216	0.251	2040.21	1.183	108.22
512	128	9728	0.252	2028.41	1.192	107.39
512	128	10240	0.255	2006.18	1.204	106.35
512	128	10752	0.257	1988.99	1.247	102.62
512	128	11264	0.261	1963.06	1.264	101.28
512	128	11776	0.262	1951.61	1.257	101.84
512	128	12288	0.265	1932.03	1.260	101.62
512	128	12800	0.269	1901.02	1.269	100.83
512	128	13312	0.272	1882.44	1.271	100.72
512	128	13824	0.273	1873.24	1.274	100.48
512	128	14336	0.277	1845.12	1.281	99.91
512	128	14848	0.280	1830.87	1.290	99.24
512	128	15360	0.282	1812.46	1.296	98.79
512	128	15872	0.286	1793.02	1.302	98.31
512	128	16384	0.288	1778.72	1.309	97.80
512	128	16896	0.293	1745.22	1.316	97.26
512	128	17408	0.295	1732.67	1.323	96.76
512	128	17920	0.299	1714.14	1.331	96.19
512	128	18432	0.301	1698.58	1.337	95.74
512	128	18944	0.306	1675.72	1.350	94.84
512	128	19456	0.307	1668.01	1.349	94.88
512	128	19968	0.313	1636.65	1.360	94.11
512	128	20480	0.314	1632.97	1.366	93.72
512	128	20992	0.316	1620.05	1.374	93.17
512	128	21504	0.319	1606.86	1.411	90.70
512	128	22016	0.322	1590.15	1.426	89.75
512	128	22528	0.327	1567.20	1.422	90.01
512	128	23040	0.330	1553.12	1.425	89.83
512	128	23552	0.333	1536.30	1.434	89.28
512	128	24064	0.337	1520.89	1.434	89.24
512	128	24576	0.339	1508.19	1.440	88.87
512	128	25088	0.343	1492.82	1.446	88.52
512	128	25600	0.344	1487.87	1.451	88.21
512	128	26112	0.350	1461.28	1.459	87.74
512	128	26624	0.350	1463.85	1.466	87.32
512	128	27136	0.354	1445.83	1.474	86.86
512	128	27648	0.357	1432.50	1.485	86.20
512	128	28160	0.363	1410.51	1.487	86.10
512	128	28672	0.365	1402.82	1.493	85.72
512	128	29184	0.368	1389.55	1.502	85.22
512	128	29696	0.371	1379.92	1.508	84.87
512	128	30208	0.374	1367.99	1.514	84.55
512	128	30720	0.377	1359.40	1.524	84.00
512	128	31232	0.378	1353.18	1.529	83.72
512	128	31744	0.382	1338.84	1.538	83.22
512	128	32256	0.386	1327.16	1.578	81.10

Appendix and Definitions

👈 PPL, KLD, Δp Statistics

In general these attempt to systematically measure the difference an unquantized model and a given quantized version. In general lower is better as it signals the quantized version performs more similarly to the original.

Quantization is the process of compressing an original model's weights to shrink it down to run on limited hardware. Ideally the process minimizes errors and preserves the original uncompressed model's performance.

Perplexity (PPL)

Perplexity (PPL) is a metric used to evaluate how well a language model predicts text. It essentially measures how "surprised" the model is by a given text—if the model is good at predicting the next word, the perplexity is low. For example, a model that generates coherent, contextually accurate text will have lower perplexity than one that produces random or nonsensical output.

In the context of LLM quantization (e.g., reducing model precision to save resources), perplexity is used to check if the compressed model retains its language understanding. Generally the PPL of the unquantized model is expected to be lower than the PPL of a quantized version.

However, in quantization-aware training (QAT), the model is trained to handle lower-precision weights (e.g., from bf16 to int4) during training, simulating the effects of quantization. This helps the model adapt to the reduced precision, potentially maintaining performance even after quantization.

The PPL of the unquantized bf16 model might not always be lower because the quantized model (trained with QAT) might retain performance close to the original bf16 model. If QAT is effective, the quantized model’s PPL could be similar to or even higher than the original, meaning the unquantized model’s PPL isn’t necessarily lower.

Kullback-Leibler Divergence (KLD)

KL-Divergence (KLD) is a statistical measure that quantifies how different two probability distributions are. In the context of Large Language Models (LLMs) and quantization, it’s used to compare how a compressed (quantized) model differs from the original (unquantized) model in terms of their output probabilities.

If two models produce nearly identical predictions (e.g., same probabilitie for words in a sentence), their KLD is low. If their predictions diverge significantly (e.g., the quantized model chooses different word more often), the KLD is high.

Typically a very large KLD baseline data file is generated on the original (or least quantized) version of the model. This baseline is then compared against quantized versions to measure KLD as well as Δp.

Δp (Delta p)

Δp Token Probability Distribution Difference refers to the difference in token probability distributions between an unquantized (full-precision) model and a quantized model. It measures how much the probabilities assigned to individual tokens (e.g., words or subwords) change after quantization.

For example, for each token in a given input sequence, the unquantized model computes a probability distribution over the vocabulary (e.g., "the" has 10% chance, "cat" has 5%, etc.). The quantized model (e.g., IQ4_K or Q2_K_L) computes a similar distribution, but due to precision loss, the probabilities may shift. Δp is the absolute or relative difference between these two distributions for each token.

A specific example would be that the unquantized model assigns 0.2 to "cat" and the quantized model assigns 0.15, the Δp for "cat" is 0.05.

👈 Benchmark Suites

Benchmarking Suite

GPQA Diamond

GPQA Diamond Set: A subset of 198 high-objectivity, challenging multiple-choice questions designed for advanced testing. Difficulty aligns with college-level or higher expertise in biology, physics, and chemistry. Intended for evaluating AI systems' ability to handle complex, domain-specific tasks requiring deep knowledge and critical thinking.

MBPP

MBPP Mostly Basic Programming Problems is a benchmark dataset designed to evaluate large language models (LLMs) on programming tasks focusing on Python code. The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases.

MMLU-Pro

MMLU-Pro is an enhanced benchmark designed to evaluate language understanding models across broader and more challenging tasks. Building on the Massive Multitask Language Understanding (MMLU) dataset, MMLU-Pro integrates more challenging, reasoning-focused questions and increases the answer choices per question from four to ten, significantly raising the difficulty and reducing the chance of success through random guessing. MMLU-Pro comprises over 12,000 rigorously curated questions from academic exams and textbooks, spanning 14 diverse domains including Biology, Business, Chemistry, Computer Science, Economics, Engineering, Health, History, Law, Math, Philosophy, Physics, Psychology, and Others.

MT-Bench

MT-Bench is a benchmark designed to evaluate the multi-turn conversational abilities and instruction-following skills of large language models (LLMs). Unlike traditional benchmarks that focus on closed-ended tasks (e.g., multiple-choice questions), MT-Bench emphasizes open-ended, real-world interactions to measure how well models handle complex, dynamic dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks.

MixEval

MixEval is a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU), with its queries being stably and effortlessly updated every month to avoid contamination.

References

ik_llama.cpp Qwen3 Quants Discussion
calibration_data_v5_rc.txt - ubergarm uses tristandruyen's imatrix calibration dataset
wiki.test.raw.gz
ubergarm-kld-test-corpus.txt - Private gist available upon request if you don't use it for training or fine-tuning or imatrix calibration etc.
visualization of Qwen3-30B-A3B imatrix statistics

ubergarm/Qwen3-MoE-Benchmarks.md

The Great Quant Wars of 2025

tl;dr;

Background

Quick Thanks

Graphs

Perplexity

KLD Stats

Δp Stats

Perplexity

KLD Stats

Δp Stats

llama-sweep-bench

Methodology

Perplexity

KLD

imatrix

Benchmark Suite

Raw Data

bartowski/Q4_K_M

bartowski/Q2_K_L

bartowski/IQ2_M

unsloth/UD-Q4_K_XL

unsloth/UD-Q2_K_XL

unsloth/UD-IQ2_M

Appendix and Definitions

Perplexity (PPL)

Kullback-Leibler Divergence (KLD)

Δp (Delta p)

Benchmarking Suite

GPQA Diamond

MBPP

MMLU-Pro

MT-Bench

MixEval

References

leonbeckert commented May 15, 2025

Uh oh!

leonbeckert commented May 15, 2025

Uh oh!

ubergarm commented May 16, 2025

Uh oh!

leonbeckert commented May 17, 2025

Uh oh!

abdurrahmanregi commented Jul 4, 2025

Uh oh!

ubergarm commented Jul 5, 2025

Uh oh!