pthurmond/12 Small LLMs Benchmark Template.md

Created June 11, 2026 15:30

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/pthurmond/378b8974e2f0fb6a4d598c97871a0f93.js"></script>
Save pthurmond/378b8974e2f0fb6a4d598c97871a0f93 to your computer and use it in GitHub Desktop.

Download ZIP

Local LLM Benchmark Templates

Raw

12 Small LLMs Benchmark Template.md

🧠 12 Small LLMs Benchmarked on 15 Reasoning Questions — Template

Use this to replicate the test found in "12 Small LLMs Benchmarked on 15 Reasoning Questions (16384 ctx).md".

Test: [e.g., 5 Logic + 5 Coding + 5 Math questions] Context: [e.g., 16384] Setup: [e.g., all models tested locally with identical prompts] Date: [YYYY-MM-DD] Hardware: [GPU/CPU/RAM]

How to use this section: Fill in the run metadata before testing so future comparisons are apples-to-apples.

🏆 Full Rankings (15 questions)

How to use this section: Add one row per model. Use X rank for failed/incomplete runs, and explain in Notes.

Rank	Model	Params	Score	Logic (5)	Code (5)	Math (5)	Speed	Notes
[1]	[Model Name]	[e.g., 4B]	[__/15]	[__/5]	[__/5]	[__/5]	[very fast/fast/normal/slow]	[optional]
[2]	[Model Name]	[e.g., 8B]	[__/15]	[__/5]	[__/5]	[__/5]	[speed]	[optional]
[X]	[Model Name]	[e.g., 12B]	[Crashed/Incomplete]	[-]	[-]	[-]	[failed/very slow]	[what happened]

🔥 Key Findings

How to use this section: Summarize the top 3–5 takeaways from this run only. Keep each finding concise and evidence-based.

1. [Finding title]

[Observation]
[Supporting comparison]

2. [Finding title]

[Observation]
[Supporting comparison]

3. [Finding title]

[Observation]
[Supporting comparison]

4. Parameter efficiency (optional)

How to use this subsection: Include only if active parameter data is known.

Model	Active Params	Score	Score/B
[Model A]	[e.g., 3B]	[__]	[__]
[Model B]	[e.g., 4B]	[__]	[__]

5. Hardest questions

How to use this subsection: List the questions most frequently missed and how many models failed each.

[Question ID + short label]: [x/y] models failed
[Question ID + short label]: [x/y] models failed
[Question ID + short label]: [x/y] models failed

⚡ Speed Notes

How to use this section: Group models by practical inference speed at this context size. Keep labels consistent across reports.

Very fast: [model names]
Fast: [model names]
Normal: [model names]
Slow: [model names]
Too slow to test: [model names]

❌ Models to Avoid

How to use this section: Include only models with clearly poor outcomes (very low score, repeated failures, instability, or unusable latency).

[Model] ([score]) - [reason]
[Model] ([score/status]) - [reason]

Test Details

How to use this section: Keep prompts, question set, and grading key stable between benchmark runs.

📋 TEST QUESTIONS

How to use this section: Replace with your exact test set. Keep IDs stable (S1..S15) so historical comparisons are easy.

GENERAL INTELLIGENCE (Logic & Reasoning)

S1. It is known that 5 machines produce 5 widgets in 5 minutes. How many minutes would it take for 100 machines to produce 100 widgets?

S2. Half of a lake surface is covered with water hyacinths. Every day, the covered area doubles. If it takes 48 days to completely cover the lake, how many days did it take to cover half of the lake?

S3. There are 3 fathers and 3 sons going to a doctor. What is the total number of people?

S4. Find the next number in the sequence: 2, 6, 12, 20, 30, 42, ?

S5. "Some doctors are surgeons. All surgeons are meticulous. Therefore, some doctors are meticulous." Is this inference valid?

CODING

S6. What does the following Python code return?

def mystery(lst):
    return [x**2 for x in lst if x % 2 == 0]

print(mystery([1, 2, 3, 4, 5, 6]))

S7. What is the output of the following JavaScript code?

const arr = [1, 2, 3];
const result = arr.reduce((acc, val) => acc + val, 10);
console.log(result);

S8. What is the most efficient approach to find the middle element of a linked list?

S9. What is the result of the following SQL query?

SELECT department, COUNT(*) as cnt
FROM employees
WHERE salary > 50000
GROUP BY department
HAVING COUNT(*) > 2
ORDER BY cnt DESC;

S10. When designing a REST API, which HTTP method and status code are correct for deleting a resource?

MATHEMATICS

S11. log₂(64) + log₂(8) = ?

S12. What is the derivative f'(x) of f(x) = 3x² + 2x − 1?

S13. A bag contains 3 red, 5 blue, and 2 green balls. If two balls are randomly selected, what is the probability that both are blue?

S14. Solve the equation: 3x − 7 = 5x + 1

S15. In the sequence where a₁ = 2 and aₙ = 2·aₙ₋₁ + 1, what is the value of a₄?

✅ ANSWER KEY

Question	Correct Answer
S1	5
S2	47
S3	4
S4	56
S5	Yes, valid
S6	[4, 16, 36]
S7	16
S8	Two pointers (tortoise and hare) — O(1) space
S9	Departments with >2 employees earning >50k, sorted descending
S10	DELETE + 204 No Content
S11	9
S12	6x + 2
S13	2/9
S14	x = −4
S15	23

Questions included: machine/widget ratio, exponential pond growth, father-son puzzle, sequence completion, syllogism, Python list comprehension, JS reduce, linked list middle, SQL aggregation, REST API, logarithms, derivatives, probability, linear equations, recurrence relations.

Raw

12 Small LLMs Benchmarked on 15 Reasoning Questions (16384 ctx).md

🧠 12 Small LLMs Benchmarked on 15 Reasoning Questions (16384 ctx)

Test: 5 Logic + 5 Coding + 5 Math questions Context: 16384 All models tested locally with identical prompts

🏆 Full Rankings (15 questions)

Rank	Model	Params	Score	Logic (5)	Code (5)	Math (5)	Speed
1	Qwen/Qwen3.6-35B-A3B (base)	35B MoE	14/15	4/5	5/5	5/5	fast
1	Qwen/Qwen3.5-9B-Claude-Opus-4.7	9B	14/15	4/5	5/5	5/5	slow
2	Qwen/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled	4B	13/15	3/5	5/5	5/5	fast
3	Google/Gemma-4-E2B	~2-4B	12/15	3/5	4/5	5/5	normal
3	Nvidia/Nemotron-3-Nano-4B	4B	12/15	2/5	5/5	5/5	fast
3	OpenAI/GPT-OSS-20B	20B	12/15	2/5	5/5	5/5	slow
4	MistralAI/Ministral-3B	3B	11/15	3/5	5/5	3/5	very fast
5	Meta/Llama-3.1-8B-Instruct	8B	10/15	2/5	5/5	3/5	normal
5	lfm2.5-8B	8B	10/15	2/5	3/5	5/5	normal
6	IBM/Granite-4-H-Tiny	~2-4B	9/15	2/5	5/5	2/5	normal
6	Qwen/Qwen3.6-14B	14B	9/15	1/5	4/5	4/5	normal
7	Microsoft/Phi-4-mini-reasoning	~4B	5/15	0/5	2/5	1/5	normal
X	Negentropy/Negentropy-Claude-Opus-4.7-4B	4B	Crashed	-	-	-	failed
X	Google/Gemma4-12B	12B	Incomplete	-	-	-	very slow

🔥 Key Findings

1. Distillation is powerful but inconsistent

Qwen3.5-4B-Distilled: 13/15 (great)
Qwen3.6-35B-A3B-Claude-Apex: 11/15

2. 4B models beat 20B models

Qwen3.5-4B-Distilled (13/15) > GPT-OSS-20B (12/15)

3. Parameter efficiency champion (active params)

Model	Active	Score	Score/B
Qwen3.6-35B-A3B	3B	14	4.67
Ministral-3B	3B	11	3.67
Qwen3.5-4B-Distilled	4B	13	3.25

4. Hardest questions

S3 (father-son puzzle): 8/12 models failed
S1 (machine/widget ratio): 7/12 failed
S2 (pond growth): 5/12 failed

⚡ Speed Notes (16384 context)

Very fast: Ministral-3B
Fast: Qwen3.5-4B-Distilled, Nemotron-4B
Slow: Qwen3.5-9B-Claude, GPT-OSS-20B
Too slow to test: Gemma4-12B

❌ Models to Avoid

Phi-4-mini-reasoning (5/15) - poor reasoning despite name
Negentropy-4B - crashed on question 3
Gemma4-12B - too slow to use on rtx 4050 -_-

Test Details

**Tests run at 16384 context.

📋 TEST QUESTIONS (English)

GENERAL INTELLIGENCE (Logic & Reasoning)

S1. It is known that 5 machines produce 5 widgets in 5 minutes. How many minutes would it take for 100 machines to produce 100 widgets?

S2. Half of a lake surface is covered with water hyacinths. Every day, the covered area doubles. If it takes 48 days to completely cover the lake, how many days did it take to cover half of the lake?

S3. There are 3 fathers and 3 sons going to a doctor. What is the total number of people?

S4. Find the next number in the sequence: 2, 6, 12, 20, 30, 42, ?

S5. "Some doctors are surgeons. All surgeons are meticulous. Therefore, some doctors are meticulous." Is this inference valid?

CODING

S6. What does the following Python code return?

def mystery(lst):
    return [x**2 for x in lst if x % 2 == 0]

print(mystery([1, 2, 3, 4, 5, 6]))

S7. What is the output of the following JavaScript code?

const arr = [1, 2, 3];
const result = arr.reduce((acc, val) => acc + val, 10);
console.log(result);

S8. What is the most efficient approach to find the middle element of a linked list?

S9. What is the result of the following SQL query?

SELECT department, COUNT(*) as cnt
FROM employees
WHERE salary > 50000
GROUP BY department
HAVING COUNT(*) > 2
ORDER BY cnt DESC;

S10. When designing a REST API, which HTTP method and status code are correct for deleting a resource?

MATHEMATICS

S11. log₂(64) + log₂(8) = ?

S12. What is the derivative f'(x) of f(x) = 3x² + 2x − 1?

S13. A bag contains 3 red, 5 blue, and 2 green balls. If two balls are randomly selected, what is the probability that both are blue?

S14. Solve the equation: 3x − 7 = 5x + 1

S15. In the sequence where a₁ = 2 and aₙ = 2·aₙ₋₁ + 1, what is the value of a₄?

✅ ANSWER KEY

Question	Correct Answer
S1	5
S2	47
S3	4
S4	56
S5	Yes, valid
S6	[4, 16, 36]
S7	16
S8	Two pointers (tortoise and hare) — O(1) space
S9	Departments with >2 employees earning >50k, sorted descending
S10	DELETE + 204 No Content
S11	9
S12	6x + 2
S13	2/9
S14	x = −4
S15	23

Raw

README.md

Local LLM Benchmark Templates

The purpose of this is to provide some templates to help test and share how different models perform on different machines and configurations for local users.

This is based on the following Reddit post I came across: https://www.reddit.com/r/LocalLLM/comments/1u2decq/i_tested_12_small_llms_1b35b_on_a_15question/

Files:

12 Small LLMs Benchmarked on 15 Reasoning Questions (16384 ctx).md - This is the neatly formatted results from the aforementioned post.
12 Small LLMs Benchmark Template.md - This is a template based on that post with the results removed, but the test details intact.
Local LLM Benchmark Template.md - This is a more generic but neatly formatted template that does not include the test questions but instead provides structure to create new tests.

Details

Reference these files, run your own tests or create new ones, share with the world.

Raw

Local LLM Benchmark Template.md

🧠 12 Small LLMs Benchmarked on 15 Reasoning Questions (16384 ctx) — Template

How to use this section: Fill in the run metadata before testing so future comparisons are apples-to-apples.

🏆 Full Rankings (15 questions)

How to use this section: Add one row per model. Use X rank for failed/incomplete runs, and explain in Notes.

Rank	Model	Params	Score	Logic (5)	Code (5)	Math (5)	Speed	Notes
[1]	[Model Name]	[e.g., 4B]	[__/15]	[__/5]	[__/5]	[__/5]	[very fast/fast/normal/slow]	[optional]
[2]	[Model Name]	[e.g., 8B]	[__/15]	[__/5]	[__/5]	[__/5]	[speed]	[optional]
[X]	[Model Name]	[e.g., 12B]	[Crashed/Incomplete]	[-]	[-]	[-]	[failed/very slow]	[what happened]

🔥 Key Findings

How to use this section: Summarize the top 3–5 takeaways from this run only. Keep each finding concise and evidence-based.

1. [Finding title]

[Observation]
[Supporting comparison]

2. [Finding title]

[Observation]
[Supporting comparison]

3. [Finding title]

[Observation]
[Supporting comparison]

4. Parameter efficiency (optional)

How to use this subsection: Include only if active parameter data is known.

Model	Active Params	Score	Score/B
[Model A]	[e.g., 3B]	[__]	[__]
[Model B]	[e.g., 4B]	[__]	[__]

5. Hardest questions

How to use this subsection: List the questions most frequently missed and how many models failed each.

[Question ID + short label]: [x/y] models failed
[Question ID + short label]: [x/y] models failed
[Question ID + short label]: [x/y] models failed

⚡ Speed Notes

How to use this section: Group models by practical inference speed at this context size. Keep labels consistent across reports.

Very fast: [model names]
Fast: [model names]
Normal: [model names]
Slow: [model names]
Too slow to test: [model names]

❌ Models to Avoid

How to use this section: Include only models with clearly poor outcomes (very low score, repeated failures, instability, or unusable latency).

[Model] ([score]) - [reason]
[Model] ([score/status]) - [reason]

Test Details

How to use this section: Keep prompts, question set, and grading key stable between benchmark runs.

📋 TEST QUESTIONS

How to use this section: Replace with your exact test set. Keep IDs stable (S1..S15) so historical comparisons are easy.

GENERAL INTELLIGENCE (Logic & Reasoning)

S1. [Question text]

S2. [Question text]

S3. [Question text]

S4. [Question text]

S5. [Question text]

CODING

S6. [Question text]

# Optional code snippet for S6

S7. [Question text]

// Optional code snippet for S7

S8. [Question text]

S9. [Question text]

-- Optional code snippet for S9

S10. [Question text]

MATHEMATICS

S11. [Question text]

S12. [Question text]

S13. [Question text]

S14. [Question text]

S15. [Question text]

✅ ANSWER KEY

How to use this section: Keep answers deterministic when possible. Update this first before scoring models.

Question	Correct Answer
S1	[answer]
S2	[answer]
S3	[answer]
S4	[answer]
S5	[answer]
S6	[answer]
S7	[answer]
S8	[answer]
S9	[answer]
S10	[answer]
S11	[answer]
S12	[answer]
S13	[answer]
S14	[answer]
S15	[answer]

Scoring Notes (optional)

How to use this section: Define grading rules once and reuse them every run.

[e.g., exact match required for numeric answers]
[e.g., partial credit policy if used]
[e.g., timeout/failure handling]