- Ridge acceptance rate: 57% vs baseline 42% (late stage). The adaptive scorer IS better at finding useful problems — 22% fewer wasted generations (589 vs 758 attempts for 320 useful groups).
- The per-level breakdown shows baseline wins at every level, but the adaptive scorer was handicapped by a KL explosion at step 19 (kl=10.2, grad_norm=4.66).
- Eval accuracy: baseline 82.2% > ridge 78.7% > cluster 77.6%. Better exploration didn't translate to better training.
- kNN mode crashed due to argpartition bug (now fixed).
- Cluster Thompson sampling collapsed to 2/14 clusters (Algebra/easy and Algebra/hard). Only 47 out of 320 trained groups got cluster updates logged. The prior for unseen clusters (random*0.25) can't compete with observed variance once any cluster gets a few samples.
-
The scorer replaces the tracker entirely. The adaptive scorer controls BOTH exploration (finding new problems) AND exploitation (replaying known-good problems). It should only control exploration — exploitation should use per-problem observations.
-
Thompson sampling collapses with rejection sampling. Rejection rejects zero-variance groups, which means clusters that produce easy/impossible problems never get
update()calls, so Thompson never learns about them. Only clusters that produce frontier problems accumulate stats. -
Cold start dominates. With 20 steps × 16 groups, the ridge scorer observes only 238/7500 problems (3.2%). Fitting 4096-dim ridge on 238 points is underdetermined. The benefit of better exploration never overcomes the early-step waste.
-
Eval wasn't stratified. The regular eval overweights easy problems where all scorers perform similarly. The cluster/adaptive approaches should shine on harder problems that require better training diversity.
-
explore_frac=10% is too low for 14 clusters. With 16 groups/step, that's ~1.6 explore groups — less than 1 per cluster per 9 steps. Clusters barely get sampled during exploration.
The key insight from the user: the adaptive scorer should only control exploration. Exploitation (replay) should use per-problem stats.
┌─────────────────┐
│ HybridScorer │
│ │
│ select(n, step) │
│ │ │
│ ┌───┴───┐ │
│ │ │ │
│ ▼ ▼ │
│ explore exploit │
│ (cluster (tracker │
│ or per- │
│ ridge) problem) │
└─────────────────┘
The HybridScorer:
- Exploration (explore_frac of groups): Use cluster-level Thompson or ridge regression to pick NEW problems likely to have variance
- Exploitation (rest): Use per-problem tracker (EMA reward variance + UCB) to repick known-good problems
update()feeds BOTH subsystems — tracker gets per-problem stats, cluster/ridge gets aggregate data- Critically,
update()should be called for ALL generated groups, not just accepted ones — zero-variance feedback is signal too
Currently the generator only calls scorer.update() for groups that pass the variance filter. Zero-variance groups (all-correct or all-wrong) get silently discarded. This means:
- The tracker doesn't learn "this problem is too easy/hard"
- Thompson sampling never learns about low-variance clusters
- Ridge regression misses ~50% of its training data
Fix: Call scorer.update() for ALL generated groups, even rejected ones. The scorer should see the full picture.
Goal: Establish fair baselines with stratified eval and the update-on-rejection fix.
Changes:
--stratified_evalon all runs--priority_explore_frac 0.25(up from 0.10)- Fix: call
scorer.update()for rejected groups too - Fix: kNN argpartition bug (already fixed)
Runs (4 runs, ~6 hours):
--scorer tracker(baseline)--scorer cluster --cluster_key type_level(cluster, should benefit from update fix)--scorer adaptive --adaptive_mode ridge(ridge)--scorer adaptive --adaptive_mode knn(kNN, now with bug fix)
Hypothesis: Cluster should improve significantly with update-on-rejection fix, because Thompson sampling will properly learn about all clusters instead of collapsing.
Goal: Test the hybrid architecture where exploration uses cluster/ridge and exploitation uses per-problem tracker.
Implementation:
- New
HybridScorerclass that wraps both aTrackerScorerand aClusterScorer(orAdaptiveScorer) select(n, step, exclude_ids):n_explore = int(n * explore_frac)→ delegated to cluster/adaptive scorern_exploit = n - n_explore→ delegated to tracker scorer- Both receive
exclude_idsfor dedup
update(problem_id, rewards, step):- Forwards to BOTH sub-scorers
- Called for ALL groups (accepted and rejected)
Runs (4 runs, ~6 hours):
--scorer tracker(baseline, same as Exp 1)--scorer hybrid --hybrid_explore cluster --cluster_key type_level --priority_explore_frac 0.25--scorer hybrid --hybrid_explore ridge --priority_explore_frac 0.25--scorer hybrid --hybrid_explore ridge --priority_explore_frac 0.50(aggressive explore)
Hypothesis: Hybrid should match or beat baseline on exploitation (same tracker for replay), while getting better exploration (fewer wasted generations on unknown problems).
Goal: Test whether adaptive/hybrid approaches improve with more data (they should — tracker scales linearly, adaptive generalizes).
Same 4 configs as Experiment 2, but with --total_optim_steps 40 --eval_every_optim_steps 10.
Hypothesis: Adaptive/hybrid should close the gap or overtake baseline at higher step counts, because the ridge model gets better predictions with more observations.
Primary: Stratified eval accuracy at each checkpoint.
Secondary (exploration efficiency):
- Acceptance rate of fresh groups (should be higher for adaptive/hybrid)
- Unique clusters sampled (should be more diverse for cluster/hybrid)
- Number of generation attempts per useful group (lower = better)
- Time-to-buffer-full (how quickly the buffer fills with good groups)
Diagnostic:
- Per-cluster update counts (should be more balanced with update-on-rejection fix)
- Ridge/kNN prediction accuracy (correlation of predicted vs actual variance)
- KL divergence and grad norm stability
- Update-on-rejection fix (30 min) — biggest bang for buck, fixes Thompson collapse
- Stratified eval default (5 min) — fair comparison
- HybridScorer (2 hours) — the core architectural change
- explore_frac=0.25 (CLI flag, already exists)
- Run experiments (~6 hours per experiment)
Total implementation time: ~3 hours before first experiment launch.
- Experiment 1 (fix + re-run): Low risk, high value. Even if results don't change, we'll know the fixes work.
- Experiment 2 (hybrid): Medium risk. The hybrid architecture is sound in theory, but the exploration benefit may not overcome the 3.2% coverage problem in 20 steps.
- Experiment 3 (longer training): Medium risk. 40 steps doubles training time but may show the crossover point where adaptive starts winning.
Biggest risk: The fundamental problem may be that 7500 problems with 4096-dim embeddings is too sparse for any online learning to work in 20-40 steps. If ridge/kNN can't predict variance from embeddings (and our probe study showed r=0.058), then the adaptive scorer is trying to learn an unlearnable function.
Counter-argument: The probe study used FIXED ground-truth variance from historical runs. The adaptive scorer learns from ONLINE observations where the model is changing. Early-training variance patterns may be more predictable than cross-run averages. Also, the probe used the model's own hidden states — the adaptive scorer uses pre-computed task embeddings, which capture task structure rather than model uncertainty.