Original Qwen3.6-35B-A3B benchmark table extended to include scores for Claude Opus 4.7, Claude Opus 4.6, Claude Sonnet 4.6, gpt-5.4-high, and gemini-3.1-pro-preview. Empty cells (—) indicate scores not published by the vendor or not applicable. Cross-vendor numbers are not directly comparable because each vendor uses different evaluation harnesses — see Methodology Flags at the bottom.
| Benchmark | Opus 4.7 | Qwen3.5-27B | Qwen3.5-35BA3B | Gemma4-31B | Gemma4-26BA4B | Qwen3.6-35BA3B | Opus 4.6 | Sonnet 4.6 | gpt-5.4-high | gemini-3.1-pro-preview |
|---|---|---|---|---|---|---|---|---|---|---|
| SWE-bench Verified | 87.6 | 75.0 | 70.0 | 52.0 | 17.4 | 73.4 | 80.8 | 79.6 | 77.2* | 80.6 |
| SWE-bench Multilingual | — | 69.3 | 60.3 | 51.7 | 17.3 | 67.2 | 77.8 | — | — | — |
| SWE-bench Pro | 64.3 | 51.2 | 44.6 | 35.7 | 13.8 | 49.5 | 53.4 | — | 57.7 | 54.2 |
| Terminal-Bench 2.0 | 69.4 | 41.6 | 40.5 | 42.9 | 34.2 | 51.5 | 65.4 | 59.1 | 75.1† | 68.5 |
| Claw-Eval Avg | — | 64.3 | 65.4 | 48.5 | 58.8 | 68.7 | — | — | — | — |
| Claw-Eval Pass^3 | — | 46.2 | 51.0 | 25.0 | 28.0 | 50.0 | — | — | — | — |
| SkillsBench Avg5 | — | 27.2 | 4.4 | 23.6 | 12.3 | 28.7 | — | — | — | — |
| QwenClawBench | — | 52.2 | 47.7 | 41.7 | 38.7 | 52.6 | — | — | — | — |
| NL2Repo | — | 27.3 | 20.5 | 15.5 | 11.6 | 29.4 | — | — | — | — |
| QwenWebBench | — | 1068 | 978 | 1197 | 1178 | 1397 | — | — | — | — |
* gpt-5.4-high SWE-bench Verified: OpenAI did not publish a Verified number in the 5.4 launch (they emphasized SWE-bench Pro). 77.2% is an independent Vals.ai run, not vendor-reported. † gpt-5.4-high Terminal-Bench 2.0: self-reported with OpenAI's own harness, not Terminus-2. The independent tbench.ai ForgeCode run with Terminus-KIRA shows 74.7%.
| Benchmark | Opus 4.7 | Qwen3.5-27B | Qwen3.5-35BA3B | Gemma4-31B | Gemma4-26BA4B | Qwen3.6-35BA3B | Opus 4.6 | Sonnet 4.6 | gpt-5.4-high | gemini-3.1-pro-preview |
|---|---|---|---|---|---|---|---|---|---|---|
| TAU3-Bench | — | 68.4 | 68.9 | 67.5 | 59.0 | 67.2 | — | — | — | — |
| VITA-Bench | — | 41.8 | 29.1 | 43.0 | 36.9 | 35.6 | — | — | — | — |
| DeepPlanning | — | 22.6 | 22.8 | 24.0 | 16.2 | 25.9 | — | — | — | — |
| Tool Decathlon | — | 31.5 | 28.7 | 21.2 | 12.0 | 26.9 | — | — | — | — |
| MCPMark | — | 36.3 | 27.0 | 18.1 | 14.2 | 37.0 | — | — | — | — |
| MCP-Atlas | 77.3 | 68.4 | 62.4 | 57.2 | 50.0 | 62.8 | 75.8 | 61.3 | 68.1 | 69.2 |
| WideSearch | — | 66.4 | 59.1 | 35.2 | 38.3 | 60.1 | — | — | — | — |
| Benchmark | Opus 4.7 | Qwen3.5-27B | Qwen3.5-35BA3B | Gemma4-31B | Gemma4-26BA4B | Qwen3.6-35BA3B | Opus 4.6 | Sonnet 4.6 | gpt-5.4-high | gemini-3.1-pro-preview |
|---|---|---|---|---|---|---|---|---|---|---|
| MMLU-Pro | — | 86.1 | 85.3 | 85.2 | 82.6 | 85.2 | — | 79.1 | — | — |
| MMLU-Redux | — | 93.2 | 93.3 | 93.7 | 92.7 | 93.3 | — | — | — | — |
| SuperGPQA | — | 65.6 | 63.4 | 65.7 | 61.4 | 64.7 | — | — | — | — |
| C-Eval | — | 90.5 | 90.2 | 82.6 | 82.5 | 90.0 | — | — | — | — |
| Benchmark | Opus 4.7 | Qwen3.5-27B | Qwen3.5-35BA3B | Gemma4-31B | Gemma4-26BA4B | Qwen3.6-35BA3B | Opus 4.6 | Sonnet 4.6 | gpt-5.4-high | gemini-3.1-pro-preview |
|---|---|---|---|---|---|---|---|---|---|---|
| GPQA | 94.2 | 85.5 | 84.2 | 84.3 | 82.3 | 86.0 | 91.3 | 74.1 | 92.8 | 94.3 |
| HLE | 46.9 | 24.3 | 22.4 | 19.5 | 8.7 | 21.4 | 40.0 | — | 39.8 | 44.4 |
| HLE (with tools) | 54.7 | — | — | — | — | — | 53.0 | — | 52.1 | 51.4 |
| LiveCodeBench v6 | — | 80.7 | 74.6 | 80.0 | 77.1 | 80.4 | — | — | — | — |
| HMMT Feb 25 | — | 92.0 | 89.0 | 88.7 | 91.7 | 90.7 | — | — | — | — |
| HMMT Nov 25 | — | 89.8 | 89.2 | 87.5 | 87.5 | 89.1 | — | — | — | — |
| HMMT Feb 26 | — | 84.3 | 78.7 | 77.2 | 79.0 | 83.6 | — | — | — | — |
| IMOAnswerBench | — | 79.9 | 76.8 | 74.5 | 74.3 | 78.9 | — | — | — | — |
| AIME26 | — | 92.6 | 91.0 | 89.2 | 88.3 | 92.7 | — | — | — | — |
| Benchmark | Opus 4.7 | Qwen3.5-27B | Qwen3.5-35B-A3B | Gemma4-31B | Gemma4-26BA4B | Qwen3.6-35B-A3B | Claude-Sonnet-4.5 | Opus 4.6 | Sonnet 4.6 | gpt-5.4-high | gemini-3.1-pro-preview |
|---|---|---|---|---|---|---|---|---|---|---|---|
| STEM and Puzzle | |||||||||||
| MMMU | — | 82.3 | 81.4 | 80.4 | 78.4 | 81.7 | 79.6 | — | — | — | — |
| MMMU-Pro | — | 75.0 | 75.1 | 76.9 | 73.8 | 75.3 | 68.4 | 73.9 | — | — | 80.5 |
| Mathvista(mini) | — | 87.8 | 86.2 | 79.3 | 79.4 | 86.4 | 79.8 | — | — | — | — |
| ZEROBench_sub | — | 36.2 | 34.1 | 26.0 | 26.3 | 34.4 | 26.3 | — | — | — | — |
| General VQA | |||||||||||
| RealWorldQA | — | 83.7 | 84.1 | 72.3 | 72.2 | 85.3 | 70.3 | — | — | — | — |
| MMBenchEN-DEV-v1.1 | — | 92.6 | 91.5 | 90.9 | 89.0 | 92.8 | 88.3 | — | — | — | — |
| SimpleVQA | — | 56.0 | 58.3 | 52.9 | 52.2 | 58.9 | 57.6 | — | — | — | — |
| HallusionBench | — | 70.0 | 67.9 | 67.4 | 66.1 | 69.8 | 59.9 | — | — | — | — |
| Text Recognition and Document Understanding | |||||||||||
| OmniDocBench1.5 | — | 88.9 | 89.3 | 80.1 | 74.4 | 89.9 | 85.8 | — | — | — | — |
| CharXiv(RQ) | 82.1 | 79.5 | 77.5 | 67.9 | 69.0 | 78.0 | 67.2 | 68.7 | — | — | — |
| CC-OCR | — | 81.0 | 80.7 | 75.7 | 74.5 | 81.9 | 68.1 | — | — | — | — |
| AI2D_TEST | — | 92.9 | 92.6 | 89.0 | 88.3 | 92.7 | 87.0 | — | — | — | — |
| Spatial Intelligence | |||||||||||
| RefCOCO(avg) | — | 90.9 | 89.2 | — | — | 92.0 | — | — | — | — | — |
| ODInW13 | — | 41.1 | 42.6 | — | — | 50.8 | — | — | — | — | — |
| EmbSpatialBench | — | 84.5 | 83.1 | — | — | 84.3 | 71.8 | — | — | — | — |
| RefSpatialBench | — | 67.7 | 63.5 | — | — | 64.3 | — | — | — | — | — |
| Video Understanding | |||||||||||
| VideoMME(w sub.) | — | 87.0 | 86.6 | — | — | 86.6 | 81.1 | — | — | — | — |
| VideoMME(w/o sub.) | — | 82.8 | 82.5 | — | — | 82.5 | 75.3 | — | — | — | — |
| VideoMMMU | — | 82.3 | 80.4 | 81.6 | 76.0 | 83.7 | 77.6 | — | — | — | — |
| MLVU | — | 85.9 | 85.6 | — | — | 86.2 | 72.8 | — | — | — | — |
| MVBench | — | 74.6 | 74.8 | — | — | 74.6 | — | — | — | — | — |
| LVBench | — | 73.6 | 71.4 | — | — | 71.4 | — | — | — | — | — |
These benchmarks are commonly reported across the five frontier models but were not part of Qwen's original comparison:
| Benchmark | Opus 4.7 | Opus 4.6 | Sonnet 4.6 | gpt-5.4-high | gemini-3.1-pro-preview |
|---|---|---|---|---|---|
| OSWorld-Verified | 78.0 | 72.7 | 72.5 | 75.0 | — |
| BrowseComp | 79.3 | 84.0 | — | 89.3‡ | 85.9 |
| τ²-bench Retail | — | 91.9 | 91.7 | — | 90.8 |
| τ²-bench Telecom | — | 97.9 | 97.9 | — | 99.3 |
| MMMLU | 91.5 | 91.1 | — | — | 92.6 |
‡ BrowseComp 89.3 is for GPT-5.4 Pro, not GPT-5.4-high.
Elo ratings, not percentage scores. Opus 4.7 released today and is not yet ranked.
| Arena | Opus 4.7 | Opus 4.6 | Sonnet 4.6 | gpt-5.4-high | gemini-3.1-pro-preview |
|---|---|---|---|---|---|
| Text | not yet ranked | 1496 | — | 1481 | 1493 |
| Code | not yet ranked | 1545 | 1524 | 1457 | 1454 |
| Vision | not yet ranked | 1293 | 1274 | — | 1277 |
| Document | not yet ranked | 1515 | 1500 | 1484 | 1450 |
-
SWE-bench Verified harness differs by vendor: Anthropic averages over 25 trials with optional prompt modification; OpenAI uses a fixed n=477 subset on internal infra (and did not report Verified for GPT-5.4); Google uses its own bash+file-ops+submit scaffold with a +0.6% buggy-item adjustment; Qwen uses an internal agent scaffold with bash + file-edit tools. Numbers across these columns are not apples-to-apples.
-
Terminal-Bench 2.0 harness mismatch: Anthropic and Google both use the Terminus-2 harness (comparable). OpenAI's 75.1% is self-reported with a different harness and is flagged as non-comparable. Independent tbench.ai ForgeCode run for GPT-5.4: 74.7%.
-
Reasoning-effort settings: gpt-5.4-high = reasoning_effort="high". Opus 4.7 introduced a new "xhigh" level; Claude Code defaults to xhigh. Gemini 3.1 Pro scores are all "Thinking High". Qwen scores use temp=1.0 / top_p=0.95 per their table footnotes.
-
Claude extended thinking: Opus 4.7 moved to adaptive thinking + effort levels. HLE splits "no tools" vs "with tools" (web search + fetch + code exec + programmatic tool calls, max effort).
-
GPT-5.4 Pro vs GPT-5.4-high: BrowseComp 89.3%, HLE with-tools 58.7%, and GPQA 94.4% were published for GPT-5.4 Pro, a different model tier than gpt-5.4-high. The table uses GPT-5.4-high numbers where available.
-
Opus 4.6 retroactive revisions: Anthropic revised MCP-Atlas (to 75.8% per Scale AI regrade), CyberGym (66.6% → 73.8%), and HLE with-tools (53.1% → 53.0%) after initial February 5 publication. The April 16 Opus 4.7 announcement chart is the canonical source for 4.6 numbers.
-
Qwen-proprietary benchmarks: Claw-Eval, SkillsBench, QwenClawBench, NL2Repo, QwenWebBench, VITA-Bench, and WideSearch are Qwen-internal or Qwen-customized benchmarks. No frontier model vendor has published scores on these.
-
Benchmark saturation at the frontier: GPQA Diamond, MMLU-Pro, and AIME 2025 are saturated — 94.2 / 94.3 / 94.4 differences are noise. Real differentiation is in SWE-bench Pro, HLE, MCP-Atlas, BrowseComp, and Terminal-Bench 2.0.
- SWE-Bench Series: Internal agent scaffold (bash + file-edit tools); temp=1.0, top_p=0.95, 200K context window. Qwen corrected some problematic tasks in the public SWE-bench Pro set and evaluated all baselines on the refined benchmark.
- Terminal-Bench 2.0: Harbor/Terminus-2 harness; 3h timeout, 32 CPU/48 GB RAM; temp=1.0, top_p=0.95, top_k=20, max_tokens=80K, 256K ctx; avg of 5 runs.
- SkillsBench: Evaluated via OpenCode on 78 tasks (self-contained subset, excluding API-dependent tasks); avg of 5 runs.
- NL2Repo: Others are evaluated via Claude Code (temp=1.0, top_p=0.95, max_turns=900).
- QwenClawBench: Internal real-user-distribution Claw agent benchmark (open-sourcing soon); temp=0.6, 256K ctx.
- QwenWebBench: Internal front-end code generation benchmark; bilingual (EN/CN), 7 categories (Web Design, Web Apps, Games, SVG, Data Visualization, Animation, 3D); auto-render + multimodal judge; BT/Elo rating system.
- TAU3-Bench: Official user model (gpt-5.2, low reasoning effort) + default BM25 retrieval.
- VITA-Bench: Avg subdomain scores; using claude-4-sonnet as judger (official judger claude-3.7-sonnet no longer available).
- MCPMark: GitHub MCP v0.30.3; Playwright responses truncated at 32K tokens.
- MCP-Atlas: Public set score; gemini-2.5-pro judger.
- AIME 26: Full AIME 2026 (I & II); scores may differ from Qwen 3.5 notes.
- MMMU-Pro (Gemma4-31B, Gemma4-26BA4B): original table marked with asterisk indicating special evaluation condition.
- Vision language (Qwen3.6): Qwen reports that Qwen3.6-35B-A3B's vision-language performance matches Claude Sonnet 4.5 with strengths in spatial intelligence (92.0 on RefCOCO, 50.8 on ODInW13).