Opus 4.7 and Qwen3.6-35B-A3B Benchmark Comparison

Original Qwen3.6-35B-A3B benchmark table extended to include scores for Claude Opus 4.7, Claude Opus 4.6, Claude Sonnet 4.6, gpt-5.4-high, and gemini-3.1-pro-preview. Empty cells (—) indicate scores not published by the vendor or not applicable. Cross-vendor numbers are not directly comparable because each vendor uses different evaluation harnesses — see Methodology Flags at the bottom.

Coding Agent

Benchmark	Opus 4.7	Qwen3.5-27B	Qwen3.5-35BA3B	Gemma4-31B	Gemma4-26BA4B	Qwen3.6-35BA3B	Opus 4.6	Sonnet 4.6	gpt-5.4-high	gemini-3.1-pro-preview
SWE-bench Verified	87.6	75.0	70.0	52.0	17.4	73.4	80.8	79.6	77.2*	80.6
SWE-bench Multilingual	—	69.3	60.3	51.7	17.3	67.2	77.8	—	—	—
SWE-bench Pro	64.3	51.2	44.6	35.7	13.8	49.5	53.4	—	57.7	54.2
Terminal-Bench 2.0	69.4	41.6	40.5	42.9	34.2	51.5	65.4	59.1	75.1†	68.5
Claw-Eval Avg	—	64.3	65.4	48.5	58.8	68.7	—	—	—	—
Claw-Eval Pass^3	—	46.2	51.0	25.0	28.0	50.0	—	—	—	—
SkillsBench Avg5	—	27.2	4.4	23.6	12.3	28.7	—	—	—	—
QwenClawBench	—	52.2	47.7	41.7	38.7	52.6	—	—	—	—
NL2Repo	—	27.3	20.5	15.5	11.6	29.4	—	—	—	—
QwenWebBench	—	1068	978	1197	1178	1397	—	—	—	—

* gpt-5.4-high SWE-bench Verified: OpenAI did not publish a Verified number in the 5.4 launch (they emphasized SWE-bench Pro). 77.2% is an independent Vals.ai run, not vendor-reported. † gpt-5.4-high Terminal-Bench 2.0: self-reported with OpenAI's own harness, not Terminus-2. The independent tbench.ai ForgeCode run with Terminus-KIRA shows 74.7%.

General Agent

Benchmark	Opus 4.7	Qwen3.5-27B	Qwen3.5-35BA3B	Gemma4-31B	Gemma4-26BA4B	Qwen3.6-35BA3B	Opus 4.6	Sonnet 4.6	gpt-5.4-high	gemini-3.1-pro-preview
TAU3-Bench	—	68.4	68.9	67.5	59.0	67.2	—	—	—	—
VITA-Bench	—	41.8	29.1	43.0	36.9	35.6	—	—	—	—
DeepPlanning	—	22.6	22.8	24.0	16.2	25.9	—	—	—	—
Tool Decathlon	—	31.5	28.7	21.2	12.0	26.9	—	—	—	—
MCPMark	—	36.3	27.0	18.1	14.2	37.0	—	—	—	—
MCP-Atlas	77.3	68.4	62.4	57.2	50.0	62.8	75.8	61.3	68.1	69.2
WideSearch	—	66.4	59.1	35.2	38.3	60.1	—	—	—	—

Knowledge

Benchmark	Opus 4.7	Qwen3.5-27B	Qwen3.5-35BA3B	Gemma4-31B	Gemma4-26BA4B	Qwen3.6-35BA3B	Opus 4.6	Sonnet 4.6	gpt-5.4-high	gemini-3.1-pro-preview
MMLU-Pro	—	86.1	85.3	85.2	82.6	85.2	—	79.1	—	—
MMLU-Redux	—	93.2	93.3	93.7	92.7	93.3	—	—	—	—
SuperGPQA	—	65.6	63.4	65.7	61.4	64.7	—	—	—	—
C-Eval	—	90.5	90.2	82.6	82.5	90.0	—	—	—	—

STEM & Reasoning

Benchmark	Opus 4.7	Qwen3.5-27B	Qwen3.5-35BA3B	Gemma4-31B	Gemma4-26BA4B	Qwen3.6-35BA3B	Opus 4.6	Sonnet 4.6	gpt-5.4-high	gemini-3.1-pro-preview
GPQA	94.2	85.5	84.2	84.3	82.3	86.0	91.3	74.1	92.8	94.3
HLE	46.9	24.3	22.4	19.5	8.7	21.4	40.0	—	39.8	44.4
HLE (with tools)	54.7	—	—	—	—	—	53.0	—	52.1	51.4
LiveCodeBench v6	—	80.7	74.6	80.0	77.1	80.4	—	—	—	—
HMMT Feb 25	—	92.0	89.0	88.7	91.7	90.7	—	—	—	—
HMMT Nov 25	—	89.8	89.2	87.5	87.5	89.1	—	—	—	—
HMMT Feb 26	—	84.3	78.7	77.2	79.0	83.6	—	—	—	—
IMOAnswerBench	—	79.9	76.8	74.5	74.3	78.9	—	—	—	—
AIME26	—	92.6	91.0	89.2	88.3	92.7	—	—	—	—

Vision Language

Benchmark	Opus 4.7	Qwen3.5-27B	Qwen3.5-35B-A3B	Gemma4-31B	Gemma4-26BA4B	Qwen3.6-35B-A3B	Claude-Sonnet-4.5	Opus 4.6	Sonnet 4.6	gpt-5.4-high	gemini-3.1-pro-preview
STEM and Puzzle
MMMU	—	82.3	81.4	80.4	78.4	81.7	79.6	—	—	—	—
MMMU-Pro	—	75.0	75.1	76.9	73.8	75.3	68.4	73.9	—	—	80.5
Mathvista(mini)	—	87.8	86.2	79.3	79.4	86.4	79.8	—	—	—	—
ZEROBench_sub	—	36.2	34.1	26.0	26.3	34.4	26.3	—	—	—	—
General VQA
RealWorldQA	—	83.7	84.1	72.3	72.2	85.3	70.3	—	—	—	—
MMBenchEN-DEV-v1.1	—	92.6	91.5	90.9	89.0	92.8	88.3	—	—	—	—
SimpleVQA	—	56.0	58.3	52.9	52.2	58.9	57.6	—	—	—	—
HallusionBench	—	70.0	67.9	67.4	66.1	69.8	59.9	—	—	—	—
Text Recognition and Document Understanding
OmniDocBench1.5	—	88.9	89.3	80.1	74.4	89.9	85.8	—	—	—	—
CharXiv(RQ)	82.1	79.5	77.5	67.9	69.0	78.0	67.2	68.7	—	—	—
CC-OCR	—	81.0	80.7	75.7	74.5	81.9	68.1	—	—	—	—
AI2D_TEST	—	92.9	92.6	89.0	88.3	92.7	87.0	—	—	—	—
Spatial Intelligence
RefCOCO(avg)	—	90.9	89.2	—	—	92.0	—	—	—	—	—
ODInW13	—	41.1	42.6	—	—	50.8	—	—	—	—	—
EmbSpatialBench	—	84.5	83.1	—	—	84.3	71.8	—	—	—	—
RefSpatialBench	—	67.7	63.5	—	—	64.3	—	—	—	—	—
Video Understanding
VideoMME(w sub.)	—	87.0	86.6	—	—	86.6	81.1	—	—	—	—
VideoMME(w/o sub.)	—	82.8	82.5	—	—	82.5	75.3	—	—	—	—
VideoMMMU	—	82.3	80.4	81.6	76.0	83.7	77.6	—	—	—	—
MLVU	—	85.9	85.6	—	—	86.2	72.8	—	—	—	—
MVBench	—	74.6	74.8	—	—	74.6	—	—	—	—	—
LVBench	—	73.6	71.4	—	—	71.4	—	—	—	—	—

Additional frontier-model benchmarks (not in original Qwen table)

These benchmarks are commonly reported across the five frontier models but were not part of Qwen's original comparison:

Benchmark	Opus 4.7	Opus 4.6	Sonnet 4.6	gpt-5.4-high	gemini-3.1-pro-preview
OSWorld-Verified	78.0	72.7	72.5	75.0	—
BrowseComp	79.3	84.0	—	89.3‡	85.9
τ²-bench Retail	—	91.9	91.7	—	90.8
τ²-bench Telecom	—	97.9	97.9	—	99.3
MMMLU	91.5	91.1	—	—	92.6

‡ BrowseComp 89.3 is for GPT-5.4 Pro, not GPT-5.4-high.

arena.ai Elo rankings (April 16, 2026)

Elo ratings, not percentage scores. Opus 4.7 released today and is not yet ranked.

Arena	Opus 4.7	Opus 4.6	Sonnet 4.6	gpt-5.4-high	gemini-3.1-pro-preview
Text	not yet ranked	1496	—	1481	1493
Code	not yet ranked	1545	1524	1457	1454
Vision	not yet ranked	1293	1274	—	1277
Document	not yet ranked	1515	1500	1484	1450

Methodology Flags

SWE-bench Verified harness differs by vendor: Anthropic averages over 25 trials with optional prompt modification; OpenAI uses a fixed n=477 subset on internal infra (and did not report Verified for GPT-5.4); Google uses its own bash+file-ops+submit scaffold with a +0.6% buggy-item adjustment; Qwen uses an internal agent scaffold with bash + file-edit tools. Numbers across these columns are not apples-to-apples.
Terminal-Bench 2.0 harness mismatch: Anthropic and Google both use the Terminus-2 harness (comparable). OpenAI's 75.1% is self-reported with a different harness and is flagged as non-comparable. Independent tbench.ai ForgeCode run for GPT-5.4: 74.7%.
Reasoning-effort settings: gpt-5.4-high = reasoning_effort="high". Opus 4.7 introduced a new "xhigh" level; Claude Code defaults to xhigh. Gemini 3.1 Pro scores are all "Thinking High". Qwen scores use temp=1.0 / top_p=0.95 per their table footnotes.
Claude extended thinking: Opus 4.7 moved to adaptive thinking + effort levels. HLE splits "no tools" vs "with tools" (web search + fetch + code exec + programmatic tool calls, max effort).
GPT-5.4 Pro vs GPT-5.4-high: BrowseComp 89.3%, HLE with-tools 58.7%, and GPQA 94.4% were published for GPT-5.4 Pro, a different model tier than gpt-5.4-high. The table uses GPT-5.4-high numbers where available.
Opus 4.6 retroactive revisions: Anthropic revised MCP-Atlas (to 75.8% per Scale AI regrade), CyberGym (66.6% → 73.8%), and HLE with-tools (53.1% → 53.0%) after initial February 5 publication. The April 16 Opus 4.7 announcement chart is the canonical source for 4.6 numbers.
Qwen-proprietary benchmarks: Claw-Eval, SkillsBench, QwenClawBench, NL2Repo, QwenWebBench, VITA-Bench, and WideSearch are Qwen-internal or Qwen-customized benchmarks. No frontier model vendor has published scores on these.
Benchmark saturation at the frontier: GPQA Diamond, MMLU-Pro, and AIME 2025 are saturated — 94.2 / 94.3 / 94.4 differences are noise. Real differentiation is in SWE-bench Pro, HLE, MCP-Atlas, BrowseComp, and Terminal-Bench 2.0.

Footnotes from Qwen's original table

SWE-Bench Series: Internal agent scaffold (bash + file-edit tools); temp=1.0, top_p=0.95, 200K context window. Qwen corrected some problematic tasks in the public SWE-bench Pro set and evaluated all baselines on the refined benchmark.
Terminal-Bench 2.0: Harbor/Terminus-2 harness; 3h timeout, 32 CPU/48 GB RAM; temp=1.0, top_p=0.95, top_k=20, max_tokens=80K, 256K ctx; avg of 5 runs.
SkillsBench: Evaluated via OpenCode on 78 tasks (self-contained subset, excluding API-dependent tasks); avg of 5 runs.
NL2Repo: Others are evaluated via Claude Code (temp=1.0, top_p=0.95, max_turns=900).
QwenClawBench: Internal real-user-distribution Claw agent benchmark (open-sourcing soon); temp=0.6, 256K ctx.
QwenWebBench: Internal front-end code generation benchmark; bilingual (EN/CN), 7 categories (Web Design, Web Apps, Games, SVG, Data Visualization, Animation, 3D); auto-render + multimodal judge; BT/Elo rating system.
TAU3-Bench: Official user model (gpt-5.2, low reasoning effort) + default BM25 retrieval.
VITA-Bench: Avg subdomain scores; using claude-4-sonnet as judger (official judger claude-3.7-sonnet no longer available).
MCPMark: GitHub MCP v0.30.3; Playwright responses truncated at 32K tokens.
MCP-Atlas: Public set score; gemini-2.5-pro judger.
AIME 26: Full AIME 2026 (I & II); scores may differ from Qwen 3.5 notes.
MMMU-Pro (Gemma4-31B, Gemma4-26BA4B): original table marked with asterisk indicating special evaluation condition.
Vision language (Qwen3.6): Qwen reports that Qwen3.6-35B-A3B's vision-language performance matches Claude Sonnet 4.5 with strengths in spatial intelligence (92.0 on RefCOCO, 50.8 on ODInW13).

mbijon/llm-benchmarks.md

Select an option

No results found