Project: Learco Personal, Project 1409 (Graphisoft CEO search)
Date: 2026-04-04
Candidates: 359 with full LinkedIn profiles
Scoring factors: 6, all on 0-5 scale (total max: 30)
Runs per experiment: 5
Concurrency: 32 parallel requests (10 for Anthropic due to latency)
Prompt: Identical across all experiments (exported via experiment:export-prompts using production CandidateScoringPromptService)
Dealroom enrichment: Skipped (to eliminate external API nondeterminism)
For each candidate × each factor × each run, the model returns a score (0-5) and reasoning.
We define a candidate as unstable on a factor if max(scores) - min(scores) >= 2 across 5 runs.
On a 0-5 scale, spread of 2 = 40% of the scale — this is not ±1 noise, it's a meaningful disagreement.
| ID | Factor | Max | Type |
|---|---|---|---|
| 5968 | International Revenue Growth Track Record | 5 | CRITICAL |
| 5969 | Industry & Market Relevance | 5 | CRITICAL |
| 5970 | Seniority & Scope | 5 | SUPPORTING |
| 5971 | CEO Readiness & Subsidiary Leadership | 5 | SUPPORTING |
| 5972 | SaaS & Subscription Leadership | 5 | SUPPORTING |
| 5973 | Architecture Domain Affinity | 5 | SUPPORTING |
All experiments used the same prompt per candidate, built by PHP CandidateScoringPromptService::generateScoringPrompt(). Prompt includes:
- System instructions (role, JSON format, scoring rules)
- Search specification document (plain text, HTML stripped)
- Custom researcher note prompt (advanced format with company info)
- All 6 factor definitions with rubrics, edge case rules, and ZERO GUARD instruction
- Target company data (all data points from
company_datatable) - Full candidate LinkedIn profile JSON (~50-150K chars)
- Accuracy check instruction referencing all data point names
- Today's date
Average prompt length: ~160,000 characters per candidate.
| Variable | Baseline | Exp A | Exp B | Exp C | Exp D |
|---|---|---|---|---|---|
| Model | grok-4-1-fast-non-reasoning | grok-4-1-fast-reasoning | grok-4-1-fast-reasoning | grok-4.20-0309-non-reasoning | claude-sonnet-4-6 |
| Provider | xAI | xAI | xAI | xAI | Anthropic |
| Prompt change | — | — | + deep analysis instruction | — | — |
| Output format | json_schema | json_schema | json_schema | json_schema | direct JSON (no tool_use) |
- Model:
grok-4-1-fast-non-reasoning(current production scoring model) - Provider: xAI
- API:
v1/chat/completionswithresponse_format: json_schema - Cost: ~$0.011 per candidate ($0.20/M input, $0.50/M output)
- Speed: ~4-5 req/s at 32 concurrency
- Total time: ~8 min for 5 runs × 359 candidates
- Results saved:
experiments/results/experiment-baseline-41fast-nonreasoning/
- Model:
grok-4-1-fast-reasoning(xAI reasoning model) - Provider: xAI
- API:
v1/chat/completionswithresponse_format: json_schema - Cost: ~$0.011 per candidate ($0.20/M input, $0.50/M output)
- Speed: ~0.7-1.1 req/s at 32 concurrency
- Total time: ~30 min for 5 runs × 359 candidates
- Results saved:
experiments/results/experiment-a-reasoning-only/
- Model:
grok-4-1-fast-reasoning(same as Exp A) - Provider: xAI
- Prompt modification: Added "Deep Profile Analysis Required" instruction after
# Instructions:## CRITICAL: Deep Profile Analysis Required Before scoring ANY factor, you MUST perform a complete, thorough analysis of the ENTIRE candidate profile below. This means: 1. Read EVERY work experience entry — do not skip any, even if the profile is long 2. Read ALL education entries, certifications, skills, and summary sections 3. For each factor, trace evidence across the FULL career history 4. Cross-reference job titles with company descriptions to understand actual scope 5. Look for indirect signals Only after completing this full analysis should you begin scoring. - Cost: ~$0.15 per candidate
- Speed: ~1.2-1.4 req/s at 32 concurrency
- Total time: ~25 min for 5 runs × 359 candidates
- Results saved:
experiments/results/experiment-b-reasoning-deep-prompt/
- Model:
grok-4.20-0309-non-reasoning(xAI's newer, more capable non-reasoning model) - Provider: xAI
- API: Same endpoint, same schema, same prompt as Baseline
- Cost: ~$0.011 per candidate ($0.20/M input, $0.50/M output — same as all xAI models)
- Speed: ~1.0-1.3 req/s at 32 concurrency
- Total time: ~25 min for 5 runs × 359 candidates
- Results saved:
experiments/results/experiment-c-grok420-non-reasoning/
- Model:
claude-sonnet-4-6(Anthropic) - Provider: Anthropic
- API:
v1/messages— direct JSON output (no tool_use, to avoid overhead) - Cost: ~$0.15 per candidate ($3/M input, $15/M output)
- Speed: ~0.05-0.1 req/s at 32 concurrency (very slow — high per-request latency on 160K prompts)
- Total time: ~6 hours for 5 runs × 359 candidates
- Results saved:
experiments/results/experiment-d-sonnet46/
| Factor | 4.1-fast non-reas | 4.1-fast reasoning | 4.1-fast reas+deep | grok-4.20 non-reas | Sonnet 4.6 |
|---|---|---|---|---|---|
| International Revenue Growth | 3.9% * | 6.7% | 7.0% | 26.2% | 9.5% |
| Industry & Market Relevance | 8.9% * | 15.6% | 16.2% | 17.5% | 16.2% |
| Seniority & Scope | 1.4% | 0.6% | 0.3% * | 0.8% | 9.5% |
| CEO Readiness & Subsidiary Leadership | 5.8% | 4.2% | 5.3% | 2.5% * | 10.3% |
| SaaS & Subscription Leadership | 1.1% * | 2.8% | 2.2% | 8.4% | 9.5% |
| Architecture Domain Affinity | 10.9% | 14.8% | 10.3% * | 14.5% | 13.4% |
Bold * = best (lowest instability) for that factor.
All xAI models share the same pricing: $0.20/M input, $0.50/M output ≈ $0.011/candidate.
| Model | Cost/candidate | Throughput (32 conc.) | Time for 5×359 | Factors won |
|---|---|---|---|---|
| grok-4-1-fast-non-reasoning | $0.011 | ~4-5 req/s | ~8 min | 3 |
| grok-4-1-fast-reasoning | $0.011 | ~0.7-1.1 req/s | ~30 min | 0 |
| grok-4-1-fast-reasoning+deep | $0.011 | ~1.2-1.4 req/s | ~25 min | 2 |
| grok-4.20-0309-non-reasoning | $0.011 | ~1.0-1.3 req/s | ~25 min | 1 |
| claude-sonnet-4-6 | $0.170 | ~0.05-0.1 req/s | ~6 hours | 0 |
xAI pricing: $0.20/M input, $0.50/M output (same for all grok models, per ProjectCostService.php).
Anthropic Sonnet 4.6 pricing: $3/M input, $15/M output.
Average prompt: ~49K input tokens, ~1.5K output tokens per call.
| Experiment | Runs × Candidates | Cost/call | Est. Cost |
|---|---|---|---|
| Baseline (4.1-fast-non-reas) | 5 × 359 | $0.011 | ~$19 |
| Exp A (4.1-fast-reasoning) | 5 × 359 | $0.011 | ~$19 |
| Exp B (4.1-fast-reas+deep) | 5 × 359 | $0.011 | ~$19 |
| Exp C (grok-4.20-non-reas) | 5 × 359 | $0.011 | ~$19 |
| Exp D (Sonnet 4.6) | 5 × 359 | $0.170 | ~$306 |
| Total experiment | 8,975 calls | ~$382 |
grok-4-1-fast-non-reasoning wins on 3 of 6 factors and is competitive on the other 3. It is:
- 22x cheaper than reasoning models and Sonnet ($0.007 vs $0.15)
- 50-100x faster than Sonnet 4.6
- 3-5x faster than other xAI models
This is counter-intuitive. You would expect a more capable model or a reasoning model to be more consistent. The opposite is true.
Comparing grok-4-1-fast-non-reasoning vs grok-4-1-fast-reasoning:
- Industry & Market Relevance: 8.9% → 15.6% (nearly doubled, worse)
- Architecture Domain Affinity: 10.9% → 14.8% (worse)
- International Revenue Growth: 3.9% → 6.7% (worse)
- CEO Readiness: 5.8% → 4.2% (slightly better)
Reasoning adds ~3x latency, 22x cost, and makes most factors less stable.
Comparing grok-4-1-fast-reasoning vs grok-4-1-fast-reasoning+deep:
- Numbers are within ±2 percentage points on every factor
- No systematic improvement
- The prompt instruction to "read the entire profile" had no measurable effect
grok-4.20-0309-non-reasoning is dramatically worse on key factors:
- International Revenue Growth: 3.9% → 26.2% (6.7x worse than baseline)
- SaaS & Subscription: 1.1% → 8.4% (7.6x worse)
- Industry & Market: 8.9% → 17.5% (2x worse)
Only CEO Readiness improved (5.8% → 2.5%).
Sonnet 4.6 is unstable on every factor — 9.5% to 16.2%. It wins on zero factors. Specific results:
- Seniority & Scope: 9.5% vs 1.4% baseline (6.8x worse — a factor that all xAI models handle well)
- CEO Readiness: 10.3% vs 5.8% baseline (1.8x worse)
- SaaS & Subscription: 9.5% vs 1.1% baseline (8.6x worse)
Additionally, Sonnet 4.6 is extremely slow on 160K char prompts: ~0.05-0.1 req/s effective throughput, making a 5-run experiment take ~6 hours vs ~8 minutes for baseline.
Across all 5 experiments, these two factors consistently show the highest instability:
- Industry & Market Relevance: 8.9% - 17.5% unstable (never below 8.9%)
- Architecture Domain Affinity: 10.3% - 14.5% unstable (never below 10.3%)
These factors require subjective interpretation of career history relevance to the AEC/architecture domain. The model's assessment genuinely varies because the evidence is ambiguous for ~10-15% of candidates. No model or prompt change fixes this.
On grok-4-1-fast-non-reasoning:
- Seniority & Scope: 1.4% unstable
- SaaS & Subscription Leadership: 1.1% unstable
- International Revenue Growth: 3.9% unstable
- CEO Readiness: 5.8% unstable
These numbers degrade significantly on other models, proving these factors are well-written — the instability on other models is a model problem, not a factor problem.
Hypothesis: Less "thinking" = less variance on borderline cases.
A non-reasoning model maps input → output more deterministically. It pattern-matches against the scoring rubric without deliberation. When a candidate's profile is borderline on a factor (e.g., "some AEC exposure but indirect"), the fast model consistently picks the same bucket.
A reasoning model deliberates. On each run, the chain of thought may explore different aspects of the profile, reach different intermediate conclusions, and therefore arrive at different scores. More reasoning = more paths = more variance.
Hypothesis: Larger models have wider output distributions.
grok-4.20 is a more capable model — it sees more nuance, considers more angles, and has a richer internal representation. This is great for open-ended tasks but harmful for consistency on structured scoring. When the model "understands more," it also "second-guesses more."
The International Revenue Growth factor is the clearest example: it requires specific numerical evidence (25%+ growth, €30M+ revenue, cold market entry). grok-4.20 apparently interprets "cold market entry" more broadly on some runs than others, leading to 26.2% instability.
Hypothesis: Different architecture + no native json_schema enforcement.
Two factors likely contribute:
- No json_schema mode. xAI models use
response_format: json_schemawhich constrains output to the exact schema. Sonnet receives a text prompt asking for JSON — more degrees of freedom in how it structures the response, potentially affecting how it allocates attention to different factors. - Different training distribution. Sonnet is optimized for general-purpose tasks. xAI's grok models may be better calibrated for structured scoring due to different fine-tuning priorities.
Hypothesis: The model already reads the profile; the problem is interpretation, not reading.
The "lazy zero" pattern from earlier experiments (model says "insufficient information" when data is present) was largely fixed by the Score 0 redefinition. The remaining instability is not about missing data — it's about how the model weighs ambiguous evidence. Telling it to "read more carefully" doesn't change how it interprets what it finds.
-
Stick with
grok-4-1-fast-non-reasoningfor production scoring. It is the fastest and most stable model tested. All xAI models cost the same ($0.011/candidate), so speed is the differentiator — and fast-non-reasoning is 3-5x faster. -
Do not switch to reasoning models for scoring. Same cost, 3-5x slower, and less consistent. Reasoning helps on open-ended tasks, not on structured scoring with clear rubrics.
-
Do not switch to grok-4.20. Despite being "smarter," it is dramatically less stable on 4 of 6 factors. The International Revenue Growth factor goes from 3.9% to 26.2% unstable.
-
Do not use Anthropic models for scoring. Sonnet 4.6 is the worst performer on every metric: slowest (50-100x), 15x more expensive ($0.17 vs $0.011), and least stable (wins zero factors). This may change with future model versions or native json_schema support.
-
Two factors (Industry & Market, Architecture Domain) will always have ~10% unstable candidates. This is inherent to the ambiguity of matching career history to domain expertise. No model or prompt fixes this. Options:
- Accept it and flag borderline candidates for human review
- Run scoring 3 times and take the median (eliminates outliers, 3x cost — still only $0.02/candidate)
- Rework factor criteria to reduce subjective interpretation (may reduce scoring quality)
-
Prompt engineering has minimal impact on stability. The "deep analysis" instruction and Score 0 redefinition did not produce meaningful improvements once measured with the correct metric (% unstable candidates).
-
The correct metric is % unstable candidates, not "flipper rate" or "mean spread". Mean spread masks individual outliers. Flipper rate (0 vs non-0) is too narrow. Percentage of candidates with spread ≥ 40% of the scale captures what actually matters: how many candidates get unreliable scores.
| Path | Description |
|---|---|
results/experiment-baseline-41fast-nonreasoning/ |
Baseline: 5 runs, grok-4-1-fast-non-reasoning |
results/experiment-a-reasoning-only/ |
Exp A: 5 runs, grok-4-1-fast-reasoning |
results/experiment-b-reasoning-deep-prompt/ |
Exp B: 5 runs, grok-4-1-fast-reasoning + deep prompt |
results/experiment-c-grok420-non-reasoning/ |
Exp C: 5 runs, grok-4.20-0309-non-reasoning |
results/experiment-d-sonnet46/ |
Exp D: 5 runs, claude-sonnet-4-6 |
results/experiment-report-iteration-1.md |
Earlier report on Score 0 redefinition |
data/project-context.json |
Current factor definitions (with ZERO GUARD) |
data/project-context-original.json |
Original factor definitions (backup) |