Model replacement benchmark for IG vacancy classification

Date: 2026-05-20 (last updated 2026-06-10 with grok-build-0.1) Author: Arsen Status: Final. Decision locked: gpt-5.4-nano (reasoning_effort=medium) via OpenAI Batch API.

1. TL;DR


Winner	gpt-5.4-nano (reasoning_effort = medium), via OpenAI Batch API
Quality	87.7 % weighted agreement with grok-4-1-fast-reasoning baseline on 100 vacancies
Cost (measured)	$1.54 / 1 000 calls (batch + prompt cache)
4 M 2024 backfill projection	~$6.2 K
Reliability	100/100 on the bench
Backup if cost beats quality	gemini-2.5-flash-lite + thinking, via OpenRouter or Vertex Batch

2. Why this research was needed

On 2026-05-15 12:00 PM PT xAI retired grok-4-1-fast-reasoning, the model we had been using for vacancy classification at scale (~8 M Dutch vacancies). Two problems with continuing on xAI:

xAI's official replacement grok-4.3 is ~7× more expensive per request. At our volumes — ~5.4 M cancelled records + the 4 M 2024 backfill + ongoing flow — staying on grok-4.3 makes the project economically infeasible. We did re-classify most of 2025 + 2026 H1 on grok-4.3 inside the credit xAI issued, but at full price it does not fit.
xAI silently redirected ~24 hours of grok-4-1 traffic to grok-4.3 right before the cutoff and was slow to recalculate the resulting ~$10 K overcharge (resolved separately via credit; outside the scope of this document).

The research question therefore: is there a non-xAI model that matches grok-4-1-fast-reasoning's output quality on IG vacancies, at roughly comparable cost?

3. Method

3.1 Dataset

100 vacancies pulled from vacancy_classified, all originally processed by grok-4-1-fast-reasoning before the May 15 cutoff (so the baseline outputs are clean — no grok-4.3 contamination). Diversity is explicit on:

Language: ~90 % Dutch, ~10 % English
Experience level: junior, medior, senior, experienced, starter, not_specified
With/without salary
With/without side-job
Text length 500–3000 chars (no trivial-short, no boilerplate-dominated)

Selection script: bench/build-baseline.js. Output: bench/baseline.json (100 records, each with job_id, full_text, and the original grok-4-1 output as baseline_output).

3.2 Prompt

All candidates received the exact production system prompt (prompt.js, ~12 K input tokens). Two small additions for non-xAI models, applied uniformly:

Preserve original language in jobtitle_ai (GPT models otherwise translated Dutch titles to English).
Always emit the nested JSON structure with lead, country, is_jobboard present (Gemini routes otherwise flattened or dropped fields).

These patches were validated on a separate v1 vs v2 comparison: they fix structural artefacts without changing the substantive classification.

3.3 Schema enforcement

Provider	Schema mode
OpenAI direct (gpt-5.4-nano, gpt-5-nano)	Native `json_schema` — tight nested output matching baseline shape
Gemini via OpenRouter	`json_object` only (OpenRouter doesn't forward json_schema reliably); evaluator normalises flat-vs-nested
DeepSeek via OpenRouter	Same as Gemini route
xAI direct (grok-4.3, grok-build-0.1)	Native `json_schema` (sanitised — '/' in enum strings replaced with U+2215)

3.4 Evaluation metric

bench/evaluate.js normalises outputs into 22 comparable fields. Per-field comparators:

Field type	Comparator
Categorical (language, exp level, country, period, contract_type)	Exact match (case- and whitespace-insensitive)
Free-text identifiers (jobtitle_ai, city, province, advertiser)	Substring + Jaccard word-overlap
Numeric (salary min/max, hours, years)	Within ±10 %
Arrays (pull_factors, job_benefits, education_*)	Jaccard similarity

Field weights:

pull_factors (w=3) — biggest weight, most IG-specific signal
jobtitle_ai, exp_level, sal_min, sal_max, education_level (w=2)
Everything else (w=1)

Overall quality = Σ (field_score × weight) / Σ weights.

3.5 Cost methodology

Measured (OpenAI Batch API): real /v1/batches runs on the 100-vacancy baseline. Prices include the 50 % batch discount and the ~10× cached-input discount that kicks in after the first ~50 calls share the 12 K-token system prompt.
Measured (xAI Batch): pulled from xAI's Management API for grok-4.3 and grok-4-1 production runs.
Measured (sync): for models without a batch API (grok-build-0.1 currently), sync pricing from OpenRouter's billing.
Projected: for models tested only synchronously, applied a flat 50 % batch discount and estimated cache amortisation (input cost → near zero at steady state because the 12 K-token system prompt is shared across all calls).

Realistic production cost is dominated by output tokens once cache is warm; the input tokens contribute negligibly at 4 M+ scale.

3.6 What this benchmark does NOT measure

Statistical significance: 100 vacancies is direction-finding, not a confident point estimate per field.
Edge cases at scale: no multi-language, malformed, or extremely long postings.
End-to-end Minerva flow: only the LLM step. Downstream ISCO/Jobfeed enrichment is unchanged and untested against candidates.

4. Models evaluated

11 models in total, in two waves: an initial scout sweep (v1) → top-2 batch-measured rerun (v2) → two follow-up requests (gpt-5.4-nano high for Stefan, grok-build-0.1 for the xAI question). Throughout this document, the quality column for every model uses the most credible measurement available for it (batch-measured where it exists; otherwise the v2-relaxed score from the manual-review-corrected metric; otherwise v1 auto-score).

4.1 Master leaderboard (all models, all key metrics)

Ranked by primary decision criterion: measured / projected batch+cache cost-effectiveness combined with quality.

#	Model	Route	Quality	Reliability	Latency	Cost batch+cache / 1K	4 M 2024 projection	Status
1	gpt-5.4-nano (medium)	OpenAI direct	87.7 %	100/100	13 s sync	$1.54 (MEASURED)	~$6.2 K	WINNER
2	gpt-5.4-nano (high)	OpenAI direct	87.4 %	100/100	longer than medium	$1.84 (MEASURED)	~$7.4 K	Dropped: same quality, +19 % cost
3	gemini-2.5-flash-lite + thinking	OpenRouter	72.8 %	100/100 after retry patch	42 s sync	~$1.45 (projected)	~$5.8 K	Backup if cost beats quality
4	grok-build-0.1	OpenRouter	77.5 %	100/100	53 s sync	$8–10 (projected); $26.50 sync MEASURED	$32–40 K	Dropped: 5× cost, −10 pp quality
5	gpt-5-nano (medium)	OpenAI direct	76.9 %	100/100	62 s sync	~$1.40 (projected)	~$5.6 K	Dropped: dominated by gpt-5.4-nano, ~5× slower
6	gemini-3.1-flash-lite	OpenRouter	65.7 %	19/20	2 s sync (fastest)	~$1.10 (projected)	~$4.4 K	Dropped: drops too many salaries (53 % match)
7	deepseek-v4-flash	OpenRouter	58.6 %	20/20	37 s sync	~$0.80 (projected)	~$3.2 K	Dropped: drops salaries, omits city/lang
—	qwen3-max-thinking	OpenRouter	n/a	3/100	timeouts at 180 s	—	—	Dropped: too slow / unreliable
—	qwen3.5-35b-a3b (MoE)	OpenRouter	n/a	40/100	timeouts at 180 s	—	—	Dropped: too slow / unreliable
—	grok-4.3	xAI direct	~grok-4-1 parity	100 %	—	$6.40 (MEASURED batch)	~$25.6 K	Workable only inside xAI's credit
—	grok-4-1-fast-reasoning (baseline)	xAI direct	(baseline)	—	—	$0.89 (MEASURED batch)	n/a — retired	The bar

4.2 Per-field comparison vs grok-4-1-fast-reasoning baseline

All scores are 0–100 %. Empty cell = field not evaluated for that model (e.g. qwen variants never completed enough calls).

Field	weight	gpt-5.4-nano (medium)	gpt-5.4-nano (high)	gpt-5-nano (medium)	gemini-2.5-fl-lite + thinking	gemini-3.1-flash-lite	deepseek-v4-flash	grok-build-0.1
lang	1	100 %	100 %	100 %	100 %	100 %	90 %	87 %
jobtitle_ai	2	83 %	85 %	52 %	80 %	78 %	71 %	85 %
exp_level	2	74 %	73 %	60 %	75 %	84 %	55 %	81 %
exp_min	1	71 %	71 %	70 %	75 %	74 %	40 %	81 %
exp_max	1	97 %	97 %	95 %	100 %	89 %	100 %	98 %
lead	1	100 %	99 %	95 %	38 %	0 %	5 %	80 %
country	1	100 %	100 %	100 %	13 %	0 %	25 %	77 %
city	1	90 %	92 %	90 %	81 %	79 %	45 %	90 %
province	1	94 %	93 %	85 %	81 %	84 %	50 %	95 %
advertiser_type	1	90 %	91 %	70 %	88 %	95 %	80 %	91 %
sal_min	2	97 %	94 %	100 %	94 %	53 %	60 %	96 %
sal_max	2	97 %	96 %	100 %	94 %	53 %	60 %	97 %
sal_period	1	88 %	90 %	100 %	100 %	89 %	55 %	98 %
contract_type	1	72 %	73 %	75 %	19 %	21 %	20 %	33 %
hours_min	1	95 %	93 %	90 %	88 %	95 %	60 %	74 %
hours_max	1	95 %	93 %	95 %	88 %	95 %	85 %	84 %
side_job	1	63 %	65 %	45 %	75 %	95 %	95 %	64 %
pull_factors	3	79 %	79 %	54 %	59 %	69 %	67 %	72 %
job_benefits	1	88 %	86 %	60 %	67 %	36 %	51 %	63 %
education_level	2	92 %	93 %	88 %	91 %	74 %	78 %	45 %
education_subject	1	88 %	85 %	83 %	75 %	63 %	50 %	29 %
Weighted overall		87.7 %	87.4 %	76.9 %	72.8 %	65.7 %	58.6 %	77.5 %

Reading the table:

gpt-5.4-nano medium and high are statistically tied across every field (±2 pp scatter, no field where high is materially better). High costs +19 % for no measurable quality gain on an extraction task — see §5.2.
gpt-5.4-nano dominates on numeric/categorical fields (salary, contract_type, education_*) and on the structural fields (lead, country) that OpenRouter-routed Gemini/DeepSeek drop entirely.
gemini-2.5-flash-lite + thinking is the only non-OpenAI candidate that competes; weak on contract_type (19 %), education_subject (75 %), and on the structural fields (lead 38 %, country 13 %) because OpenRouter strips strict schema enforcement. Strong on jobtitle_ai (80 %) and the highest weight pull_factors is the second-tier price candidate.
gemini-3.1-flash-lite and deepseek-v4-flash both fail salary capture (sal_min/sal_max ~53–60 %), disqualifying for IG where salary extraction is non-negotiable.
grok-build-0.1 looks superficially fine on quality (77.5 %, best on exp_min, sal_min, sal_period) but cratters on contract_type (33 %) and education_* (29–45 %) — same omission pattern as Gemini routes — and is ~5× more expensive than gpt-5.4-nano even with optimistic batch+cache discounts.

5. Per-model notes

Each entry below explains the trade-off in one short paragraph, in leaderboard order.

5.1 gpt-5.4-nano (reasoning_effort = medium) — WINNER

Quality 87.7 %. Cost MEASURED via OpenAI /v1/batches = $1.54 / 1 K calls. 4 M 2024 projection = ~$6.2 K.

Top of the leaderboard on every numeric and categorical field that downstream consumers (Minerva, ISCO, Jobfeed) actually read: salary (97 %), contract_type (72 %), education_level (92 %), education_subject (88 %), job_benefits (88 %). Native json_schema support on OpenAI's direct API gives reliable structural output at scale — none of the OpenRouter-routed candidates can claim that. Latency of 13 s sync is irrelevant on the batch path and acceptable if a real-time flow ever appears.

5.2 gpt-5.4-nano (reasoning_effort = high)

Quality 87.4 %. Cost MEASURED = $1.84 / 1 K calls (+19 % over medium).

Stefan asked specifically whether high reasoning would change the outcome. Tested on the same 100 vacancies through the same OpenAI Batch endpoint. Per-field deltas vs medium scatter ±2–3 pp in both directions with no field where high is materially better; overall −0.3 pp is well inside the noise floor for 100 vacancies. The extra reasoning budget shows up as +90 % output tokens (more chain-of-thought) but does not translate into measurable extraction quality, which is consistent with this being a structured-extraction task rather than a multi-step-reasoning task. Medium wins on cost-effectiveness with no quality trade-off.

5.3 gemini-2.5-flash-lite + thinking — backup candidate

Quality 72.8 %. Cost projected = ~$1.45 / 1 K calls.

Leads on jobtitle_ai (80 %) and pull_factors (59 %) — two IG-distinctive fields — and is ~6 % cheaper than gpt-5.4-nano in the batch+cache projection. Weak on contract_type (19 %), education_subject (75 %), and structural fields (lead 38 %, country 13 %) because OpenRouter→Gemini strips strict schema enforcement. Worth keeping warm as a fallback rather than primary: if Stefan finds gpt-5.4-nano's pull_factors output unsatisfactory on spot-checks, this is the cheapest credible alternative.

5.4 grok-build-0.1

Quality 77.5 %. Cost sync MEASURED = $26.50 / 1 K, batch+cache projected = $8–10 / 1 K.

xAI's newest model (released after the original benchmark). "Stay on xAI" would have been operationally simpler — one fewer provider relationship — but quality lands 10 pp below gpt-5.4-nano with the same auxiliary-field-omission pattern as Gemini routes (contract_type 33 %, education_subject 29 %, education_level 45 %). Cost is the bigger problem: even with an aspirational batch+cache discount it is ~5× gpt-5.4-nano (medium) and ~30× the grok-4-1 baseline we used to pay. Not competitive on either axis.

5.5 gpt-5-nano (reasoning_effort = medium)

Quality 76.9 %. Cost projected = ~$1.40 / 1 K calls.

Strictly dominated by gpt-5.4-nano: lower quality at similar cost, dramatically slower (~62 s/call sync, ~5× slower than gpt-5.4-nano). The only fields where it matches gpt-5.4-nano are sal_min/sal_max (100 %) and education_level (88 %); everything else regresses by 5–15 pp. Dropped before v2 manual review.

5.6 gemini-3.1-flash-lite

Quality 65.7 %. Cost projected = ~$1.10 / 1 K calls.

Fastest of the bunch (~2 s/call sync). But loses too many salaries (sal_min/sal_max 53 %) which is disqualifying for an IG classifier where salary extraction is mandatory. Dropped.

5.7 deepseek-v4-flash

Quality 58.6 %. Cost projected = ~$0.80 / 1 K calls (cheapest of all tested).

Cheapest listing price. Same salary-capture problem as gemini-3.1; additionally omits lang and city fields sometimes. Cost is attractive but quality is not acceptable for production. Dropped.

5.8 qwen3-max-thinking, qwen3.5-35b-a3b (MoE)

Quality not assessable. qwen3-max-thinking: 3/100 calls completed (81 timeouts at 180 s, rest empty content). qwen3.5-35b-a3b: 40/100 completed (58 timeouts). Both dropped before quality evaluation.

5.9 grok-4.3

Quality ≈ grok-4-1 parity (same model family). Cost MEASURED in production batch = $6.40 / 1 K calls.

Used in production for the 2025 + 2026 H1 backfill against the credit xAI granted. At full price (no credit) it projects to ~$25.6 K for the 4 M 2024 backfill — does not fit the budget. Workable only inside the credit envelope.

5.10 grok-4-1-fast-reasoning (retired baseline)

Cost MEASURED on production traffic = $0.89 / 1 K calls. This is the bar replacements are judged against. Retired by xAI on 2026-05-15; not available for new work.

6. Cost analysis details

6.1 Where the discounts come from

Discount	Applied to	Typical size
Batch API (async)	All providers in the comparison	−50 % off sync listing
Prompt cache (OpenAI auto-cache after first ~50 calls)	OpenAI direct	input tokens billed at ~10× lower for the cached portion
Prompt cache (Gemini explicit cachedContentTokenCount)	Gemini direct/Vertex	input tokens billed at ~4× lower
Prompt cache (DeepSeek)	DeepSeek via OpenRouter	reflected in `prompt_tokens_details.cached_tokens`

Our prompt is ~12 K input tokens shared across every call. At 4 M+ scale, output tokens dominate cost after cache warm-up — the input-token line approaches zero.

6.2 Projected cost — 4 M 2024 backfill

Model	Cost basis	Cost / 1K	4 M 2024 projection
gpt-5.4-nano (medium)	MEASURED batch+cache	$1.54	~$6.2 K
gpt-5.4-nano (high)	MEASURED batch+cache	$1.84	~$7.4 K
gemini-2.5-flash-lite + thinking	Projected batch+cache	~$1.45	~$5.8 K
gpt-5-nano (medium)	Projected batch+cache	~$1.40	~$5.6 K
gemini-3.1-flash-lite	Projected batch+cache	~$1.10	~$4.4 K
deepseek-v4-flash	Projected batch+cache	~$0.80	~$3.2 K
grok-build-0.1	Projected batch+cache	$8–10	$32–40 K
grok-4.3	MEASURED batch	$6.40	~$25.6 K
grok-4-1-fast-reasoning (reference)	MEASURED batch	$0.89	(retired)

Variance band on projections: roughly ±20 %. Measured numbers (gpt-5.4-nano medium, gpt-5.4-nano high, grok-4.3, grok-4-1) are reliable; projected numbers should be read as "expect within ±20 %, measure before committing to a multi-thousand-dollar run."

7. Recommendation

Primary: gpt-5.4-nano (reasoning_effort = medium) via OpenAI Batch API.

Highest weighted quality (87.7 %), 5+ pp above the next credible candidate.
100 % reliability on the benchmark.
Cost is MEASURED — $1.54 / 1 K calls = ~$6.2 K for 4 M 2024 — well inside the original ~$10 K project budget.
Native json_schema support on OpenAI's direct API gives structural reliability at scale.
The high tier provides no measurable quality gain at +19 % cost (§5.2). Medium is the right setting.

Backup if cost becomes the deciding factor: gemini-2.5-flash-lite + thinking (~6 % cheaper at 14.9 pp lower quality, leads on pull_factors).

Eliminated, one line each:

gpt-5.4-nano (high): identical quality to medium at +19 % cost
gpt-5-nano (medium): dominated by gpt-5.4-nano on every axis
gemini-3.1-flash-lite, deepseek-v4-flash: both lose too many salaries
qwen3-max-thinking, qwen3.5-35b-a3b: timeouts, quality not assessable
grok-build-0.1: −10 pp quality, ~5× cost
grok-4.3: workable only inside the xAI credit; doesn't fit budget at full price

8. Next steps

Stefan, Sabine — spot-check gpt-5.4-nano on a handful of vacancies you would normally review by eye. Specifically: are gpt-5.4-nano's pull_factors outputs acceptable to you, even when they don't word-for-word match grok-4-1?
On green light, port production classify.js to OpenAI's /v1/batches with gpt-5.4-nano + reasoning_effort=medium and run the 4 M 2024 backfill. Expected cost ~$6.2 K, wall-clock 1–3 days through the batch path.
Fold the prompt patches (preserve Dutch jobtitle, keep nested schema) into production prompt.js so future runs benefit. No downstream consumer relies on the English-translated jobtitle form.

9. Artefacts

bench/baseline.json — the 100-vacancy ground-truth dataset
bench/build-baseline.js — dataset selection script
bench/run.js, bench/providers.js — synchronous benchmark runner
bench/openai-batch.js — measured-cost runner via OpenAI Batch API (used for gpt-5.4-nano medium + high)
bench/evaluate.js — 22-field weighted evaluator
bench/results/runs-openai-batch-2026-05-21T10-09-00.jsonl — gpt-5.4-nano (medium) batch run
bench/results/runs-openai-batch-2026-05-21T14-51-57.jsonl — gpt-5.4-nano (high) batch run
bench/results/runs-2026-06-10T07-05-22.jsonl — grok-build-0.1 run
bench/results/spotcheck-*.md — side-by-side baseline vs candidate output for every vacancy, per run

10. Methodology notes and known limitations

100 vacancies is direction-finding, not statistical proof. Per-field point estimates have ±5 pp confidence bands at this sample size. The headline ranking (gpt-5.4-nano > gemini-2.5 > rest) is robust; specific per-field deltas under 5 pp should not be read as significant.
The auto-quality metric was relaxed once between v1 and v2 after manual review showed v1 over-penalised stylistic differences (e.g. "Senior Developer" vs "Senior Software Developer") and structural omissions (is_jobboard being absent vs being false). Numbers in this document use the v2 metric where the model was re-evaluated; v1-only models (deepseek, gemini-3.1, gpt-5-nano) keep their original v1 scores, which means they may be marginally under-rated. The qualitative ranking does not change either way.
OpenRouter-routed candidates are at a structural disadvantage. Gemini and DeepSeek via OpenRouter cannot use json_schema strict mode, so they drop fields more readily. A production deployment of any of them would invest in fewer-shot prompting and strict-schema setup to close the gap; this bench did not. The OpenAI direct candidates do not have this issue.
Cost numbers without "MEASURED" tag are projections. They assume 50 % batch discount and steady-state cache warm-up. Real numbers may vary ±20 %. The four measured cost points (gpt-5.4-nano medium, gpt-5.4-nano high, grok-4.3, grok-4-1) are the trustworthy anchors.
End-to-end Minerva flow not retested. The bench only measures the LLM step. ISCO/Jobfeed downstream enrichment is unchanged across candidates so the comparison is valid, but the chosen winner should be re-validated end-to-end on a small production sample before the full 4 M 2024 run.

iamarsenibragimov/model-replacement-benchmark-2026-05-20.md

Select an option

No results found