Adaptive Problem Selection v2: Experiment Proposal

What We Learned from v1

The good news

Ridge acceptance rate: 57% vs baseline 42% (late stage). The adaptive scorer IS better at finding useful problems — 22% fewer wasted generations (589 vs 758 attempts for 320 useful groups).
The per-level breakdown shows baseline wins at every level, but the adaptive scorer was handicapped by a KL explosion at step 19 (kl=10.2, grad_norm=4.66).

The bad news

Eval accuracy: baseline 82.2% > ridge 78.7% > cluster 77.6%. Better exploration didn't translate to better training.
kNN mode crashed due to argpartition bug (now fixed).
Cluster Thompson sampling collapsed to 2/14 clusters (Algebra/easy and Algebra/hard). Only 47 out of 320 trained groups got cluster updates logged. The prior for unseen clusters (random*0.25) can't compete with observed variance once any cluster gets a few samples.

Root causes identified

The scorer replaces the tracker entirely. The adaptive scorer controls BOTH exploration (finding new problems) AND exploitation (replaying known-good problems). It should only control exploration — exploitation should use per-problem observations.
Thompson sampling collapses with rejection sampling. Rejection rejects zero-variance groups, which means clusters that produce easy/impossible problems never get update() calls, so Thompson never learns about them. Only clusters that produce frontier problems accumulate stats.
Cold start dominates. With 20 steps × 16 groups, the ridge scorer observes only 238/7500 problems (3.2%). Fitting 4096-dim ridge on 238 points is underdetermined. The benefit of better exploration never overcomes the early-step waste.
Eval wasn't stratified. The regular eval overweights easy problems where all scorers perform similarly. The cluster/adaptive approaches should shine on harder problems that require better training diversity.
explore_frac=10% is too low for 14 clusters. With 16 groups/step, that's ~1.6 explore groups — less than 1 per cluster per 9 steps. Clusters barely get sampled during exploration.

Proposed Architecture: Hybrid Scorer

The key insight from the user: the adaptive scorer should only control exploration. Exploitation (replay) should use per-problem stats.

                         ┌─────────────────┐
                         │  HybridScorer    │
                         │                  │
                         │  select(n, step) │
                         │      │           │
                         │  ┌───┴───┐       │
                         │  │       │       │
                         │  ▼       ▼       │
                         │ explore exploit  │
                         │ (cluster (tracker │
                         │  or      per-    │
                         │  ridge)  problem) │
                         └─────────────────┘

The HybridScorer:

Exploration (explore_frac of groups): Use cluster-level Thompson or ridge regression to pick NEW problems likely to have variance
Exploitation (rest): Use per-problem tracker (EMA reward variance + UCB) to repick known-good problems
update() feeds BOTH subsystems — tracker gets per-problem stats, cluster/ridge gets aggregate data
Critically, update() should be called for ALL generated groups, not just accepted ones — zero-variance feedback is signal too

Key change: update on rejection too

Currently the generator only calls scorer.update() for groups that pass the variance filter. Zero-variance groups (all-correct or all-wrong) get silently discarded. This means:

The tracker doesn't learn "this problem is too easy/hard"
Thompson sampling never learns about low-variance clusters
Ridge regression misses ~50% of its training data

Fix: Call scorer.update() for ALL generated groups, even rejected ones. The scorer should see the full picture.

Experiment Plan

Experiment 1: Bug Fixes + Stratified Eval (baseline re-run)

Goal: Establish fair baselines with stratified eval and the update-on-rejection fix.

Changes:

--stratified_eval on all runs
--priority_explore_frac 0.25 (up from 0.10)
Fix: call scorer.update() for rejected groups too
Fix: kNN argpartition bug (already fixed)

Runs (4 runs, ~6 hours):

--scorer tracker (baseline)
--scorer cluster --cluster_key type_level (cluster, should benefit from update fix)
--scorer adaptive --adaptive_mode ridge (ridge)
--scorer adaptive --adaptive_mode knn (kNN, now with bug fix)

Hypothesis: Cluster should improve significantly with update-on-rejection fix, because Thompson sampling will properly learn about all clusters instead of collapsing.

Experiment 2: Hybrid Scorer

Goal: Test the hybrid architecture where exploration uses cluster/ridge and exploitation uses per-problem tracker.

Implementation:

New HybridScorer class that wraps both a TrackerScorer and a ClusterScorer (or AdaptiveScorer)
select(n, step, exclude_ids):
- n_explore = int(n * explore_frac) → delegated to cluster/adaptive scorer
- n_exploit = n - n_explore → delegated to tracker scorer
- Both receive exclude_ids for dedup
update(problem_id, rewards, step):
- Forwards to BOTH sub-scorers
- Called for ALL groups (accepted and rejected)

Runs (4 runs, ~6 hours):

--scorer tracker (baseline, same as Exp 1)
--scorer hybrid --hybrid_explore cluster --cluster_key type_level --priority_explore_frac 0.25
--scorer hybrid --hybrid_explore ridge --priority_explore_frac 0.25
--scorer hybrid --hybrid_explore ridge --priority_explore_frac 0.50 (aggressive explore)

Hypothesis: Hybrid should match or beat baseline on exploitation (same tracker for replay), while getting better exploration (fewer wasted generations on unknown problems).

Experiment 3: Longer Training (40 steps)

Goal: Test whether adaptive/hybrid approaches improve with more data (they should — tracker scales linearly, adaptive generalizes).

Same 4 configs as Experiment 2, but with --total_optim_steps 40 --eval_every_optim_steps 10.

Hypothesis: Adaptive/hybrid should close the gap or overtake baseline at higher step counts, because the ridge model gets better predictions with more observations.

Metrics to Track

Primary: Stratified eval accuracy at each checkpoint.

Secondary (exploration efficiency):

Acceptance rate of fresh groups (should be higher for adaptive/hybrid)
Unique clusters sampled (should be more diverse for cluster/hybrid)
Number of generation attempts per useful group (lower = better)
Time-to-buffer-full (how quickly the buffer fills with good groups)

Diagnostic:

Per-cluster update counts (should be more balanced with update-on-rejection fix)
Ridge/kNN prediction accuracy (correlation of predicted vs actual variance)
KL divergence and grad norm stability

Implementation Priority

Update-on-rejection fix (30 min) — biggest bang for buck, fixes Thompson collapse
Stratified eval default (5 min) — fair comparison
HybridScorer (2 hours) — the core architectural change
explore_frac=0.25 (CLI flag, already exists)
Run experiments (~6 hours per experiment)

Total implementation time: ~3 hours before first experiment launch.

Risk Assessment

Experiment 1 (fix + re-run): Low risk, high value. Even if results don't change, we'll know the fixes work.
Experiment 2 (hybrid): Medium risk. The hybrid architecture is sound in theory, but the exploration benefit may not overcome the 3.2% coverage problem in 20 steps.
Experiment 3 (longer training): Medium risk. 40 steps doubles training time but may show the crossover point where adaptive starts winning.

Biggest risk: The fundamental problem may be that 7500 problems with 4096-dim embeddings is too sparse for any online learning to work in 20-40 steps. If ridge/kNN can't predict variance from embeddings (and our probe study showed r=0.058), then the adaptive scorer is trying to learn an unlearnable function.

Counter-argument: The probe study used FIXED ground-truth variance from historical runs. The adaptive scorer learns from ONLINE observations where the model is changing. Early-training variance patterns may be more predictable than cross-run averages. Also, the probe used the model's own hidden states — the adaptive scorer uses pre-computed task embeddings, which capture task structure rather than model uncertainty.

aria42/experiment-adaptive-v2.md

Select an option

No results found