Date: 2026-05-20 (last updated 2026-06-10 with grok-build-0.1) Author: Arsen Status: Final. Decision locked: gpt-5.4-nano (reasoning_effort=medium) via OpenAI Batch API.
| Winner | gpt-5.4-nano (reasoning_effort = medium), via OpenAI Batch API |
| Quality | 87.7 % weighted agreement with grok-4-1-fast-reasoning baseline on 100 vacancies |
| Cost (measured) | $1.54 / 1 000 calls (batch + prompt cache) |
| 4 M 2024 backfill projection | ~$6.2 K |
| Reliability | 100/100 on the bench |
| Backup if cost beats quality | gemini-2.5-flash-lite + thinking, via OpenRouter or Vertex Batch |
On 2026-05-15 12:00 PM PT xAI retired grok-4-1-fast-reasoning, the model we had been using for vacancy classification at scale (~8 M Dutch vacancies). Two problems with continuing on xAI:
- xAI's official replacement
grok-4.3is ~7× more expensive per request. At our volumes — ~5.4 M cancelled records + the 4 M 2024 backfill + ongoing flow — staying on grok-4.3 makes the project economically infeasible. We did re-classify most of 2025 + 2026 H1 on grok-4.3 inside the credit xAI issued, but at full price it does not fit. - xAI silently redirected ~24 hours of grok-4-1 traffic to grok-4.3 right before the cutoff and was slow to recalculate the resulting ~$10 K overcharge (resolved separately via credit; outside the scope of this document).
The research question therefore: is there a non-xAI model that matches grok-4-1-fast-reasoning's output quality on IG vacancies, at roughly comparable cost?
100 vacancies pulled from vacancy_classified, all originally processed by grok-4-1-fast-reasoning before the May 15 cutoff (so the baseline outputs are clean — no grok-4.3 contamination). Diversity is explicit on:
- Language: ~90 % Dutch, ~10 % English
- Experience level: junior, medior, senior, experienced, starter, not_specified
- With/without salary
- With/without side-job
- Text length 500–3000 chars (no trivial-short, no boilerplate-dominated)
Selection script: bench/build-baseline.js. Output: bench/baseline.json (100 records, each with job_id, full_text, and the original grok-4-1 output as baseline_output).
All candidates received the exact production system prompt (prompt.js, ~12 K input tokens). Two small additions for non-xAI models, applied uniformly:
- Preserve original language in
jobtitle_ai(GPT models otherwise translated Dutch titles to English). - Always emit the nested JSON structure with
lead,country,is_jobboardpresent (Gemini routes otherwise flattened or dropped fields).
These patches were validated on a separate v1 vs v2 comparison: they fix structural artefacts without changing the substantive classification.
| Provider | Schema mode |
|---|---|
| OpenAI direct (gpt-5.4-nano, gpt-5-nano) | Native json_schema — tight nested output matching baseline shape |
| Gemini via OpenRouter | json_object only (OpenRouter doesn't forward json_schema reliably); evaluator normalises flat-vs-nested |
| DeepSeek via OpenRouter | Same as Gemini route |
| xAI direct (grok-4.3, grok-build-0.1) | Native json_schema (sanitised — '/' in enum strings replaced with U+2215) |
bench/evaluate.js normalises outputs into 22 comparable fields. Per-field comparators:
| Field type | Comparator |
|---|---|
| Categorical (language, exp level, country, period, contract_type) | Exact match (case- and whitespace-insensitive) |
| Free-text identifiers (jobtitle_ai, city, province, advertiser) | Substring + Jaccard word-overlap |
| Numeric (salary min/max, hours, years) | Within ±10 % |
| Arrays (pull_factors, job_benefits, education_*) | Jaccard similarity |
Field weights:
pull_factors(w=3) — biggest weight, most IG-specific signaljobtitle_ai,exp_level,sal_min,sal_max,education_level(w=2)- Everything else (w=1)
Overall quality = Σ (field_score × weight) / Σ weights.
- Measured (OpenAI Batch API): real
/v1/batchesruns on the 100-vacancy baseline. Prices include the 50 % batch discount and the ~10× cached-input discount that kicks in after the first ~50 calls share the 12 K-token system prompt. - Measured (xAI Batch): pulled from xAI's Management API for grok-4.3 and grok-4-1 production runs.
- Measured (sync): for models without a batch API (grok-build-0.1 currently), sync pricing from OpenRouter's billing.
- Projected: for models tested only synchronously, applied a flat 50 % batch discount and estimated cache amortisation (input cost → near zero at steady state because the 12 K-token system prompt is shared across all calls).
Realistic production cost is dominated by output tokens once cache is warm; the input tokens contribute negligibly at 4 M+ scale.
- Statistical significance: 100 vacancies is direction-finding, not a confident point estimate per field.
- Edge cases at scale: no multi-language, malformed, or extremely long postings.
- End-to-end Minerva flow: only the LLM step. Downstream ISCO/Jobfeed enrichment is unchanged and untested against candidates.
11 models in total, in two waves: an initial scout sweep (v1) → top-2 batch-measured rerun (v2) → two follow-up requests (gpt-5.4-nano high for Stefan, grok-build-0.1 for the xAI question). Throughout this document, the quality column for every model uses the most credible measurement available for it (batch-measured where it exists; otherwise the v2-relaxed score from the manual-review-corrected metric; otherwise v1 auto-score).
Ranked by primary decision criterion: measured / projected batch+cache cost-effectiveness combined with quality.
| # | Model | Route | Quality | Reliability | Latency | Cost batch+cache / 1K | 4 M 2024 projection | Status |
|---|---|---|---|---|---|---|---|---|
| 1 | gpt-5.4-nano (medium) | OpenAI direct | 87.7 % | 100/100 | 13 s sync | $1.54 (MEASURED) | ~$6.2 K | WINNER |
| 2 | gpt-5.4-nano (high) | OpenAI direct | 87.4 % | 100/100 | longer than medium | $1.84 (MEASURED) | ~$7.4 K | Dropped: same quality, +19 % cost |
| 3 | gemini-2.5-flash-lite + thinking | OpenRouter | 72.8 % | 100/100 after retry patch | 42 s sync | ~$1.45 (projected) | ~$5.8 K | Backup if cost beats quality |
| 4 | grok-build-0.1 | OpenRouter | 77.5 % | 100/100 | 53 s sync | $8–10 (projected); $26.50 sync MEASURED | $32–40 K | Dropped: 5× cost, −10 pp quality |
| 5 | gpt-5-nano (medium) | OpenAI direct | 76.9 % | 100/100 | 62 s sync | ~$1.40 (projected) | ~$5.6 K | Dropped: dominated by gpt-5.4-nano, ~5× slower |
| 6 | gemini-3.1-flash-lite | OpenRouter | 65.7 % | 19/20 | 2 s sync (fastest) | ~$1.10 (projected) | ~$4.4 K | Dropped: drops too many salaries (53 % match) |
| 7 | deepseek-v4-flash | OpenRouter | 58.6 % | 20/20 | 37 s sync | ~$0.80 (projected) | ~$3.2 K | Dropped: drops salaries, omits city/lang |
| — | qwen3-max-thinking | OpenRouter | n/a | 3/100 | timeouts at 180 s | — | — | Dropped: too slow / unreliable |
| — | qwen3.5-35b-a3b (MoE) | OpenRouter | n/a | 40/100 | timeouts at 180 s | — | — | Dropped: too slow / unreliable |
| — | grok-4.3 | xAI direct | ~grok-4-1 parity | 100 % | — | $6.40 (MEASURED batch) | ~$25.6 K | Workable only inside xAI's credit |
| — | grok-4-1-fast-reasoning (baseline) | xAI direct | (baseline) | — | — | $0.89 (MEASURED batch) | n/a — retired | The bar |
All scores are 0–100 %. Empty cell = field not evaluated for that model (e.g. qwen variants never completed enough calls).
| Field | weight | gpt-5.4-nano (medium) | gpt-5.4-nano (high) | gpt-5-nano (medium) | gemini-2.5-fl-lite + thinking | gemini-3.1-flash-lite | deepseek-v4-flash | grok-build-0.1 |
|---|---|---|---|---|---|---|---|---|
| lang | 1 | 100 % | 100 % | 100 % | 100 % | 100 % | 90 % | 87 % |
| jobtitle_ai | 2 | 83 % | 85 % | 52 % | 80 % | 78 % | 71 % | 85 % |
| exp_level | 2 | 74 % | 73 % | 60 % | 75 % | 84 % | 55 % | 81 % |
| exp_min | 1 | 71 % | 71 % | 70 % | 75 % | 74 % | 40 % | 81 % |
| exp_max | 1 | 97 % | 97 % | 95 % | 100 % | 89 % | 100 % | 98 % |
| lead | 1 | 100 % | 99 % | 95 % | 38 % | 0 % | 5 % | 80 % |
| country | 1 | 100 % | 100 % | 100 % | 13 % | 0 % | 25 % | 77 % |
| city | 1 | 90 % | 92 % | 90 % | 81 % | 79 % | 45 % | 90 % |
| province | 1 | 94 % | 93 % | 85 % | 81 % | 84 % | 50 % | 95 % |
| advertiser_type | 1 | 90 % | 91 % | 70 % | 88 % | 95 % | 80 % | 91 % |
| sal_min | 2 | 97 % | 94 % | 100 % | 94 % | 53 % | 60 % | 96 % |
| sal_max | 2 | 97 % | 96 % | 100 % | 94 % | 53 % | 60 % | 97 % |
| sal_period | 1 | 88 % | 90 % | 100 % | 100 % | 89 % | 55 % | 98 % |
| contract_type | 1 | 72 % | 73 % | 75 % | 19 % | 21 % | 20 % | 33 % |
| hours_min | 1 | 95 % | 93 % | 90 % | 88 % | 95 % | 60 % | 74 % |
| hours_max | 1 | 95 % | 93 % | 95 % | 88 % | 95 % | 85 % | 84 % |
| side_job | 1 | 63 % | 65 % | 45 % | 75 % | 95 % | 95 % | 64 % |
| pull_factors | 3 | 79 % | 79 % | 54 % | 59 % | 69 % | 67 % | 72 % |
| job_benefits | 1 | 88 % | 86 % | 60 % | 67 % | 36 % | 51 % | 63 % |
| education_level | 2 | 92 % | 93 % | 88 % | 91 % | 74 % | 78 % | 45 % |
| education_subject | 1 | 88 % | 85 % | 83 % | 75 % | 63 % | 50 % | 29 % |
| Weighted overall | 87.7 % | 87.4 % | 76.9 % | 72.8 % | 65.7 % | 58.6 % | 77.5 % |
Reading the table:
- gpt-5.4-nano medium and high are statistically tied across every field (±2 pp scatter, no field where high is materially better). High costs +19 % for no measurable quality gain on an extraction task — see §5.2.
- gpt-5.4-nano dominates on numeric/categorical fields (salary, contract_type, education_*) and on the structural fields (
lead,country) that OpenRouter-routed Gemini/DeepSeek drop entirely. - gemini-2.5-flash-lite + thinking is the only non-OpenAI candidate that competes; weak on contract_type (19 %), education_subject (75 %), and on the structural fields (lead 38 %, country 13 %) because OpenRouter strips strict schema enforcement. Strong on
jobtitle_ai(80 %) and the highest weightpull_factorsis the second-tier price candidate. - gemini-3.1-flash-lite and deepseek-v4-flash both fail salary capture (sal_min/sal_max ~53–60 %), disqualifying for IG where salary extraction is non-negotiable.
- grok-build-0.1 looks superficially fine on quality (77.5 %, best on exp_min, sal_min, sal_period) but cratters on
contract_type(33 %) andeducation_*(29–45 %) — same omission pattern as Gemini routes — and is ~5× more expensive than gpt-5.4-nano even with optimistic batch+cache discounts.
Each entry below explains the trade-off in one short paragraph, in leaderboard order.
Quality 87.7 %. Cost MEASURED via OpenAI /v1/batches = $1.54 / 1 K calls. 4 M 2024 projection = ~$6.2 K.
Top of the leaderboard on every numeric and categorical field that downstream consumers (Minerva, ISCO, Jobfeed) actually read: salary (97 %), contract_type (72 %), education_level (92 %), education_subject (88 %), job_benefits (88 %). Native json_schema support on OpenAI's direct API gives reliable structural output at scale — none of the OpenRouter-routed candidates can claim that. Latency of 13 s sync is irrelevant on the batch path and acceptable if a real-time flow ever appears.
Quality 87.4 %. Cost MEASURED = $1.84 / 1 K calls (+19 % over medium).
Stefan asked specifically whether high reasoning would change the outcome. Tested on the same 100 vacancies through the same OpenAI Batch endpoint. Per-field deltas vs medium scatter ±2–3 pp in both directions with no field where high is materially better; overall −0.3 pp is well inside the noise floor for 100 vacancies. The extra reasoning budget shows up as +90 % output tokens (more chain-of-thought) but does not translate into measurable extraction quality, which is consistent with this being a structured-extraction task rather than a multi-step-reasoning task. Medium wins on cost-effectiveness with no quality trade-off.
Quality 72.8 %. Cost projected = ~$1.45 / 1 K calls.
Leads on jobtitle_ai (80 %) and pull_factors (59 %) — two IG-distinctive fields — and is ~6 % cheaper than gpt-5.4-nano in the batch+cache projection. Weak on contract_type (19 %), education_subject (75 %), and structural fields (lead 38 %, country 13 %) because OpenRouter→Gemini strips strict schema enforcement. Worth keeping warm as a fallback rather than primary: if Stefan finds gpt-5.4-nano's pull_factors output unsatisfactory on spot-checks, this is the cheapest credible alternative.
Quality 77.5 %. Cost sync MEASURED = $26.50 / 1 K, batch+cache projected = $8–10 / 1 K.
xAI's newest model (released after the original benchmark). "Stay on xAI" would have been operationally simpler — one fewer provider relationship — but quality lands 10 pp below gpt-5.4-nano with the same auxiliary-field-omission pattern as Gemini routes (contract_type 33 %, education_subject 29 %, education_level 45 %). Cost is the bigger problem: even with an aspirational batch+cache discount it is ~5× gpt-5.4-nano (medium) and ~30× the grok-4-1 baseline we used to pay. Not competitive on either axis.
Quality 76.9 %. Cost projected = ~$1.40 / 1 K calls.
Strictly dominated by gpt-5.4-nano: lower quality at similar cost, dramatically slower (~62 s/call sync, ~5× slower than gpt-5.4-nano). The only fields where it matches gpt-5.4-nano are sal_min/sal_max (100 %) and education_level (88 %); everything else regresses by 5–15 pp. Dropped before v2 manual review.
Quality 65.7 %. Cost projected = ~$1.10 / 1 K calls.
Fastest of the bunch (~2 s/call sync). But loses too many salaries (sal_min/sal_max 53 %) which is disqualifying for an IG classifier where salary extraction is mandatory. Dropped.
Quality 58.6 %. Cost projected = ~$0.80 / 1 K calls (cheapest of all tested).
Cheapest listing price. Same salary-capture problem as gemini-3.1; additionally omits lang and city fields sometimes. Cost is attractive but quality is not acceptable for production. Dropped.
Quality not assessable. qwen3-max-thinking: 3/100 calls completed (81 timeouts at 180 s, rest empty content). qwen3.5-35b-a3b: 40/100 completed (58 timeouts). Both dropped before quality evaluation.
Quality ≈ grok-4-1 parity (same model family). Cost MEASURED in production batch = $6.40 / 1 K calls.
Used in production for the 2025 + 2026 H1 backfill against the credit xAI granted. At full price (no credit) it projects to ~$25.6 K for the 4 M 2024 backfill — does not fit the budget. Workable only inside the credit envelope.
Cost MEASURED on production traffic = $0.89 / 1 K calls. This is the bar replacements are judged against. Retired by xAI on 2026-05-15; not available for new work.
| Discount | Applied to | Typical size |
|---|---|---|
| Batch API (async) | All providers in the comparison | −50 % off sync listing |
| Prompt cache (OpenAI auto-cache after first ~50 calls) | OpenAI direct | input tokens billed at ~10× lower for the cached portion |
| Prompt cache (Gemini explicit cachedContentTokenCount) | Gemini direct/Vertex | input tokens billed at ~4× lower |
| Prompt cache (DeepSeek) | DeepSeek via OpenRouter | reflected in prompt_tokens_details.cached_tokens |
Our prompt is ~12 K input tokens shared across every call. At 4 M+ scale, output tokens dominate cost after cache warm-up — the input-token line approaches zero.
| Model | Cost basis | Cost / 1K | 4 M 2024 projection |
|---|---|---|---|
| gpt-5.4-nano (medium) | MEASURED batch+cache | $1.54 | ~$6.2 K |
| gpt-5.4-nano (high) | MEASURED batch+cache | $1.84 | ~$7.4 K |
| gemini-2.5-flash-lite + thinking | Projected batch+cache | ~$1.45 | ~$5.8 K |
| gpt-5-nano (medium) | Projected batch+cache | ~$1.40 | ~$5.6 K |
| gemini-3.1-flash-lite | Projected batch+cache | ~$1.10 | ~$4.4 K |
| deepseek-v4-flash | Projected batch+cache | ~$0.80 | ~$3.2 K |
| grok-build-0.1 | Projected batch+cache | $8–10 | $32–40 K |
| grok-4.3 | MEASURED batch | $6.40 | ~$25.6 K |
| grok-4-1-fast-reasoning (reference) | MEASURED batch | $0.89 | (retired) |
Variance band on projections: roughly ±20 %. Measured numbers (gpt-5.4-nano medium, gpt-5.4-nano high, grok-4.3, grok-4-1) are reliable; projected numbers should be read as "expect within ±20 %, measure before committing to a multi-thousand-dollar run."
Primary: gpt-5.4-nano (reasoning_effort = medium) via OpenAI Batch API.
- Highest weighted quality (87.7 %), 5+ pp above the next credible candidate.
- 100 % reliability on the benchmark.
- Cost is MEASURED — $1.54 / 1 K calls = ~$6.2 K for 4 M 2024 — well inside the original ~$10 K project budget.
- Native
json_schemasupport on OpenAI's direct API gives structural reliability at scale. - The
hightier provides no measurable quality gain at +19 % cost (§5.2). Medium is the right setting.
Backup if cost becomes the deciding factor: gemini-2.5-flash-lite + thinking (~6 % cheaper at 14.9 pp lower quality, leads on pull_factors).
Eliminated, one line each:
- gpt-5.4-nano (high): identical quality to medium at +19 % cost
- gpt-5-nano (medium): dominated by gpt-5.4-nano on every axis
- gemini-3.1-flash-lite, deepseek-v4-flash: both lose too many salaries
- qwen3-max-thinking, qwen3.5-35b-a3b: timeouts, quality not assessable
- grok-build-0.1: −10 pp quality, ~5× cost
- grok-4.3: workable only inside the xAI credit; doesn't fit budget at full price
- Stefan, Sabine — spot-check gpt-5.4-nano on a handful of vacancies you would normally review by eye. Specifically: are gpt-5.4-nano's
pull_factorsoutputs acceptable to you, even when they don't word-for-word match grok-4-1? - On green light, port production
classify.jsto OpenAI's/v1/batcheswithgpt-5.4-nano+reasoning_effort=mediumand run the 4 M 2024 backfill. Expected cost ~$6.2 K, wall-clock 1–3 days through the batch path. - Fold the prompt patches (preserve Dutch jobtitle, keep nested schema) into production
prompt.jsso future runs benefit. No downstream consumer relies on the English-translated jobtitle form.
bench/baseline.json— the 100-vacancy ground-truth datasetbench/build-baseline.js— dataset selection scriptbench/run.js,bench/providers.js— synchronous benchmark runnerbench/openai-batch.js— measured-cost runner via OpenAI Batch API (used for gpt-5.4-nano medium + high)bench/evaluate.js— 22-field weighted evaluatorbench/results/runs-openai-batch-2026-05-21T10-09-00.jsonl— gpt-5.4-nano (medium) batch runbench/results/runs-openai-batch-2026-05-21T14-51-57.jsonl— gpt-5.4-nano (high) batch runbench/results/runs-2026-06-10T07-05-22.jsonl— grok-build-0.1 runbench/results/spotcheck-*.md— side-by-side baseline vs candidate output for every vacancy, per run
- 100 vacancies is direction-finding, not statistical proof. Per-field point estimates have ±5 pp confidence bands at this sample size. The headline ranking (gpt-5.4-nano > gemini-2.5 > rest) is robust; specific per-field deltas under 5 pp should not be read as significant.
- The auto-quality metric was relaxed once between v1 and v2 after manual review showed v1 over-penalised stylistic differences (e.g. "Senior Developer" vs "Senior Software Developer") and structural omissions (
is_jobboardbeing absent vs being false). Numbers in this document use the v2 metric where the model was re-evaluated; v1-only models (deepseek, gemini-3.1, gpt-5-nano) keep their original v1 scores, which means they may be marginally under-rated. The qualitative ranking does not change either way. - OpenRouter-routed candidates are at a structural disadvantage. Gemini and DeepSeek via OpenRouter cannot use json_schema strict mode, so they drop fields more readily. A production deployment of any of them would invest in fewer-shot prompting and strict-schema setup to close the gap; this bench did not. The OpenAI direct candidates do not have this issue.
- Cost numbers without "MEASURED" tag are projections. They assume 50 % batch discount and steady-state cache warm-up. Real numbers may vary ±20 %. The four measured cost points (gpt-5.4-nano medium, gpt-5.4-nano high, grok-4.3, grok-4-1) are the trustworthy anchors.
- End-to-end Minerva flow not retested. The bench only measures the LLM step. ISCO/Jobfeed downstream enrichment is unchanged across candidates so the comparison is valid, but the chosen winner should be re-validated end-to-end on a small production sample before the full 4 M 2024 run.