Skip to content

Instantly share code, notes, and snippets.

@mbijon
Last active April 17, 2026 01:24
Show Gist options
  • Select an option

  • Save mbijon/bf97d677f050d36f2d52b2f3fbd43f17 to your computer and use it in GitHub Desktop.

Select an option

Save mbijon/bf97d677f050d36f2d52b2f3fbd43f17 to your computer and use it in GitHub Desktop.
Consolidated LLM benchmarks, including Opus 4.7, Qwen 3.5 & more - 2026-04-16

Opus 4.7 and Qwen3.6-35B-A3B Benchmark Comparison

Original Qwen3.6-35B-A3B benchmark table extended to include scores for Claude Opus 4.7, Claude Opus 4.6, Claude Sonnet 4.6, gpt-5.4-high, and gemini-3.1-pro-preview. Empty cells () indicate scores not published by the vendor or not applicable. Cross-vendor numbers are not directly comparable because each vendor uses different evaluation harnesses — see Methodology Flags at the bottom.

Coding Agent

Benchmark Opus 4.7 Qwen3.5-27B Qwen3.5-35BA3B Gemma4-31B Gemma4-26BA4B Qwen3.6-35BA3B Opus 4.6 Sonnet 4.6 gpt-5.4-high gemini-3.1-pro-preview
SWE-bench Verified 87.6 75.0 70.0 52.0 17.4 73.4 80.8 79.6 77.2* 80.6
SWE-bench Multilingual 69.3 60.3 51.7 17.3 67.2 77.8
SWE-bench Pro 64.3 51.2 44.6 35.7 13.8 49.5 53.4 57.7 54.2
Terminal-Bench 2.0 69.4 41.6 40.5 42.9 34.2 51.5 65.4 59.1 75.1† 68.5
Claw-Eval Avg 64.3 65.4 48.5 58.8 68.7
Claw-Eval Pass^3 46.2 51.0 25.0 28.0 50.0
SkillsBench Avg5 27.2 4.4 23.6 12.3 28.7
QwenClawBench 52.2 47.7 41.7 38.7 52.6
NL2Repo 27.3 20.5 15.5 11.6 29.4
QwenWebBench 1068 978 1197 1178 1397

* gpt-5.4-high SWE-bench Verified: OpenAI did not publish a Verified number in the 5.4 launch (they emphasized SWE-bench Pro). 77.2% is an independent Vals.ai run, not vendor-reported. † gpt-5.4-high Terminal-Bench 2.0: self-reported with OpenAI's own harness, not Terminus-2. The independent tbench.ai ForgeCode run with Terminus-KIRA shows 74.7%.

General Agent

Benchmark Opus 4.7 Qwen3.5-27B Qwen3.5-35BA3B Gemma4-31B Gemma4-26BA4B Qwen3.6-35BA3B Opus 4.6 Sonnet 4.6 gpt-5.4-high gemini-3.1-pro-preview
TAU3-Bench 68.4 68.9 67.5 59.0 67.2
VITA-Bench 41.8 29.1 43.0 36.9 35.6
DeepPlanning 22.6 22.8 24.0 16.2 25.9
Tool Decathlon 31.5 28.7 21.2 12.0 26.9
MCPMark 36.3 27.0 18.1 14.2 37.0
MCP-Atlas 77.3 68.4 62.4 57.2 50.0 62.8 75.8 61.3 68.1 69.2
WideSearch 66.4 59.1 35.2 38.3 60.1

Knowledge

Benchmark Opus 4.7 Qwen3.5-27B Qwen3.5-35BA3B Gemma4-31B Gemma4-26BA4B Qwen3.6-35BA3B Opus 4.6 Sonnet 4.6 gpt-5.4-high gemini-3.1-pro-preview
MMLU-Pro 86.1 85.3 85.2 82.6 85.2 79.1
MMLU-Redux 93.2 93.3 93.7 92.7 93.3
SuperGPQA 65.6 63.4 65.7 61.4 64.7
C-Eval 90.5 90.2 82.6 82.5 90.0

STEM & Reasoning

Benchmark Opus 4.7 Qwen3.5-27B Qwen3.5-35BA3B Gemma4-31B Gemma4-26BA4B Qwen3.6-35BA3B Opus 4.6 Sonnet 4.6 gpt-5.4-high gemini-3.1-pro-preview
GPQA 94.2 85.5 84.2 84.3 82.3 86.0 91.3 74.1 92.8 94.3
HLE 46.9 24.3 22.4 19.5 8.7 21.4 40.0 39.8 44.4
HLE (with tools) 54.7 53.0 52.1 51.4
LiveCodeBench v6 80.7 74.6 80.0 77.1 80.4
HMMT Feb 25 92.0 89.0 88.7 91.7 90.7
HMMT Nov 25 89.8 89.2 87.5 87.5 89.1
HMMT Feb 26 84.3 78.7 77.2 79.0 83.6
IMOAnswerBench 79.9 76.8 74.5 74.3 78.9
AIME26 92.6 91.0 89.2 88.3 92.7

Vision Language

Benchmark Opus 4.7 Qwen3.5-27B Qwen3.5-35B-A3B Gemma4-31B Gemma4-26BA4B Qwen3.6-35B-A3B Claude-Sonnet-4.5 Opus 4.6 Sonnet 4.6 gpt-5.4-high gemini-3.1-pro-preview
STEM and Puzzle
MMMU 82.3 81.4 80.4 78.4 81.7 79.6
MMMU-Pro 75.0 75.1 76.9 73.8 75.3 68.4 73.9 80.5
Mathvista(mini) 87.8 86.2 79.3 79.4 86.4 79.8
ZEROBench_sub 36.2 34.1 26.0 26.3 34.4 26.3
General VQA
RealWorldQA 83.7 84.1 72.3 72.2 85.3 70.3
MMBenchEN-DEV-v1.1 92.6 91.5 90.9 89.0 92.8 88.3
SimpleVQA 56.0 58.3 52.9 52.2 58.9 57.6
HallusionBench 70.0 67.9 67.4 66.1 69.8 59.9
Text Recognition and Document Understanding
OmniDocBench1.5 88.9 89.3 80.1 74.4 89.9 85.8
CharXiv(RQ) 82.1 79.5 77.5 67.9 69.0 78.0 67.2 68.7
CC-OCR 81.0 80.7 75.7 74.5 81.9 68.1
AI2D_TEST 92.9 92.6 89.0 88.3 92.7 87.0
Spatial Intelligence
RefCOCO(avg) 90.9 89.2 92.0
ODInW13 41.1 42.6 50.8
EmbSpatialBench 84.5 83.1 84.3 71.8
RefSpatialBench 67.7 63.5 64.3
Video Understanding
VideoMME(w sub.) 87.0 86.6 86.6 81.1
VideoMME(w/o sub.) 82.8 82.5 82.5 75.3
VideoMMMU 82.3 80.4 81.6 76.0 83.7 77.6
MLVU 85.9 85.6 86.2 72.8
MVBench 74.6 74.8 74.6
LVBench 73.6 71.4 71.4

Additional frontier-model benchmarks (not in original Qwen table)

These benchmarks are commonly reported across the five frontier models but were not part of Qwen's original comparison:

Benchmark Opus 4.7 Opus 4.6 Sonnet 4.6 gpt-5.4-high gemini-3.1-pro-preview
OSWorld-Verified 78.0 72.7 72.5 75.0
BrowseComp 79.3 84.0 89.3‡ 85.9
τ²-bench Retail 91.9 91.7 90.8
τ²-bench Telecom 97.9 97.9 99.3
MMMLU 91.5 91.1 92.6

‡ BrowseComp 89.3 is for GPT-5.4 Pro, not GPT-5.4-high.

arena.ai Elo rankings (April 16, 2026)

Elo ratings, not percentage scores. Opus 4.7 released today and is not yet ranked.

Arena Opus 4.7 Opus 4.6 Sonnet 4.6 gpt-5.4-high gemini-3.1-pro-preview
Text not yet ranked 1496 1481 1493
Code not yet ranked 1545 1524 1457 1454
Vision not yet ranked 1293 1274 1277
Document not yet ranked 1515 1500 1484 1450

Methodology Flags

  1. SWE-bench Verified harness differs by vendor: Anthropic averages over 25 trials with optional prompt modification; OpenAI uses a fixed n=477 subset on internal infra (and did not report Verified for GPT-5.4); Google uses its own bash+file-ops+submit scaffold with a +0.6% buggy-item adjustment; Qwen uses an internal agent scaffold with bash + file-edit tools. Numbers across these columns are not apples-to-apples.

  2. Terminal-Bench 2.0 harness mismatch: Anthropic and Google both use the Terminus-2 harness (comparable). OpenAI's 75.1% is self-reported with a different harness and is flagged as non-comparable. Independent tbench.ai ForgeCode run for GPT-5.4: 74.7%.

  3. Reasoning-effort settings: gpt-5.4-high = reasoning_effort="high". Opus 4.7 introduced a new "xhigh" level; Claude Code defaults to xhigh. Gemini 3.1 Pro scores are all "Thinking High". Qwen scores use temp=1.0 / top_p=0.95 per their table footnotes.

  4. Claude extended thinking: Opus 4.7 moved to adaptive thinking + effort levels. HLE splits "no tools" vs "with tools" (web search + fetch + code exec + programmatic tool calls, max effort).

  5. GPT-5.4 Pro vs GPT-5.4-high: BrowseComp 89.3%, HLE with-tools 58.7%, and GPQA 94.4% were published for GPT-5.4 Pro, a different model tier than gpt-5.4-high. The table uses GPT-5.4-high numbers where available.

  6. Opus 4.6 retroactive revisions: Anthropic revised MCP-Atlas (to 75.8% per Scale AI regrade), CyberGym (66.6% → 73.8%), and HLE with-tools (53.1% → 53.0%) after initial February 5 publication. The April 16 Opus 4.7 announcement chart is the canonical source for 4.6 numbers.

  7. Qwen-proprietary benchmarks: Claw-Eval, SkillsBench, QwenClawBench, NL2Repo, QwenWebBench, VITA-Bench, and WideSearch are Qwen-internal or Qwen-customized benchmarks. No frontier model vendor has published scores on these.

  8. Benchmark saturation at the frontier: GPQA Diamond, MMLU-Pro, and AIME 2025 are saturated — 94.2 / 94.3 / 94.4 differences are noise. Real differentiation is in SWE-bench Pro, HLE, MCP-Atlas, BrowseComp, and Terminal-Bench 2.0.

Footnotes from Qwen's original table

  • SWE-Bench Series: Internal agent scaffold (bash + file-edit tools); temp=1.0, top_p=0.95, 200K context window. Qwen corrected some problematic tasks in the public SWE-bench Pro set and evaluated all baselines on the refined benchmark.
  • Terminal-Bench 2.0: Harbor/Terminus-2 harness; 3h timeout, 32 CPU/48 GB RAM; temp=1.0, top_p=0.95, top_k=20, max_tokens=80K, 256K ctx; avg of 5 runs.
  • SkillsBench: Evaluated via OpenCode on 78 tasks (self-contained subset, excluding API-dependent tasks); avg of 5 runs.
  • NL2Repo: Others are evaluated via Claude Code (temp=1.0, top_p=0.95, max_turns=900).
  • QwenClawBench: Internal real-user-distribution Claw agent benchmark (open-sourcing soon); temp=0.6, 256K ctx.
  • QwenWebBench: Internal front-end code generation benchmark; bilingual (EN/CN), 7 categories (Web Design, Web Apps, Games, SVG, Data Visualization, Animation, 3D); auto-render + multimodal judge; BT/Elo rating system.
  • TAU3-Bench: Official user model (gpt-5.2, low reasoning effort) + default BM25 retrieval.
  • VITA-Bench: Avg subdomain scores; using claude-4-sonnet as judger (official judger claude-3.7-sonnet no longer available).
  • MCPMark: GitHub MCP v0.30.3; Playwright responses truncated at 32K tokens.
  • MCP-Atlas: Public set score; gemini-2.5-pro judger.
  • AIME 26: Full AIME 2026 (I & II); scores may differ from Qwen 3.5 notes.
  • MMMU-Pro (Gemma4-31B, Gemma4-26BA4B): original table marked with asterisk indicating special evaluation condition.
  • Vision language (Qwen3.6): Qwen reports that Qwen3.6-35B-A3B's vision-language performance matches Claude Sonnet 4.5 with strengths in spatial intelligence (92.0 on RefCOCO, 50.8 on ODInW13).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment