Skip to content

Instantly share code, notes, and snippets.

@aria42
Created February 27, 2026 00:20
Show Gist options
  • Select an option

  • Save aria42/84aa3c468a4ca4b6d3c8a8ecaef9ba20 to your computer and use it in GitHub Desktop.

Select an option

Save aria42/84aa3c468a4ca4b6d3c8a8ecaef9ba20 to your computer and use it in GitHub Desktop.
Adaptive problem selection v2: experiment proposal

Adaptive Problem Selection v2: Experiment Proposal

What We Learned from v1

The good news

  • Ridge acceptance rate: 57% vs baseline 42% (late stage). The adaptive scorer IS better at finding useful problems — 22% fewer wasted generations (589 vs 758 attempts for 320 useful groups).
  • The per-level breakdown shows baseline wins at every level, but the adaptive scorer was handicapped by a KL explosion at step 19 (kl=10.2, grad_norm=4.66).

The bad news

  • Eval accuracy: baseline 82.2% > ridge 78.7% > cluster 77.6%. Better exploration didn't translate to better training.
  • kNN mode crashed due to argpartition bug (now fixed).
  • Cluster Thompson sampling collapsed to 2/14 clusters (Algebra/easy and Algebra/hard). Only 47 out of 320 trained groups got cluster updates logged. The prior for unseen clusters (random*0.25) can't compete with observed variance once any cluster gets a few samples.

Root causes identified

  1. The scorer replaces the tracker entirely. The adaptive scorer controls BOTH exploration (finding new problems) AND exploitation (replaying known-good problems). It should only control exploration — exploitation should use per-problem observations.

  2. Thompson sampling collapses with rejection sampling. Rejection rejects zero-variance groups, which means clusters that produce easy/impossible problems never get update() calls, so Thompson never learns about them. Only clusters that produce frontier problems accumulate stats.

  3. Cold start dominates. With 20 steps × 16 groups, the ridge scorer observes only 238/7500 problems (3.2%). Fitting 4096-dim ridge on 238 points is underdetermined. The benefit of better exploration never overcomes the early-step waste.

  4. Eval wasn't stratified. The regular eval overweights easy problems where all scorers perform similarly. The cluster/adaptive approaches should shine on harder problems that require better training diversity.

  5. explore_frac=10% is too low for 14 clusters. With 16 groups/step, that's ~1.6 explore groups — less than 1 per cluster per 9 steps. Clusters barely get sampled during exploration.

Proposed Architecture: Hybrid Scorer

The key insight from the user: the adaptive scorer should only control exploration. Exploitation (replay) should use per-problem stats.

                         ┌─────────────────┐
                         │  HybridScorer    │
                         │                  │
                         │  select(n, step) │
                         │      │           │
                         │  ┌───┴───┐       │
                         │  │       │       │
                         │  ▼       ▼       │
                         │ explore exploit  │
                         │ (cluster (tracker │
                         │  or      per-    │
                         │  ridge)  problem) │
                         └─────────────────┘

The HybridScorer:

  • Exploration (explore_frac of groups): Use cluster-level Thompson or ridge regression to pick NEW problems likely to have variance
  • Exploitation (rest): Use per-problem tracker (EMA reward variance + UCB) to repick known-good problems
  • update() feeds BOTH subsystems — tracker gets per-problem stats, cluster/ridge gets aggregate data
  • Critically, update() should be called for ALL generated groups, not just accepted ones — zero-variance feedback is signal too

Key change: update on rejection too

Currently the generator only calls scorer.update() for groups that pass the variance filter. Zero-variance groups (all-correct or all-wrong) get silently discarded. This means:

  • The tracker doesn't learn "this problem is too easy/hard"
  • Thompson sampling never learns about low-variance clusters
  • Ridge regression misses ~50% of its training data

Fix: Call scorer.update() for ALL generated groups, even rejected ones. The scorer should see the full picture.

Experiment Plan

Experiment 1: Bug Fixes + Stratified Eval (baseline re-run)

Goal: Establish fair baselines with stratified eval and the update-on-rejection fix.

Changes:

  • --stratified_eval on all runs
  • --priority_explore_frac 0.25 (up from 0.10)
  • Fix: call scorer.update() for rejected groups too
  • Fix: kNN argpartition bug (already fixed)

Runs (4 runs, ~6 hours):

  1. --scorer tracker (baseline)
  2. --scorer cluster --cluster_key type_level (cluster, should benefit from update fix)
  3. --scorer adaptive --adaptive_mode ridge (ridge)
  4. --scorer adaptive --adaptive_mode knn (kNN, now with bug fix)

Hypothesis: Cluster should improve significantly with update-on-rejection fix, because Thompson sampling will properly learn about all clusters instead of collapsing.

Experiment 2: Hybrid Scorer

Goal: Test the hybrid architecture where exploration uses cluster/ridge and exploitation uses per-problem tracker.

Implementation:

  • New HybridScorer class that wraps both a TrackerScorer and a ClusterScorer (or AdaptiveScorer)
  • select(n, step, exclude_ids):
    • n_explore = int(n * explore_frac) → delegated to cluster/adaptive scorer
    • n_exploit = n - n_explore → delegated to tracker scorer
    • Both receive exclude_ids for dedup
  • update(problem_id, rewards, step):
    • Forwards to BOTH sub-scorers
    • Called for ALL groups (accepted and rejected)

Runs (4 runs, ~6 hours):

  1. --scorer tracker (baseline, same as Exp 1)
  2. --scorer hybrid --hybrid_explore cluster --cluster_key type_level --priority_explore_frac 0.25
  3. --scorer hybrid --hybrid_explore ridge --priority_explore_frac 0.25
  4. --scorer hybrid --hybrid_explore ridge --priority_explore_frac 0.50 (aggressive explore)

Hypothesis: Hybrid should match or beat baseline on exploitation (same tracker for replay), while getting better exploration (fewer wasted generations on unknown problems).

Experiment 3: Longer Training (40 steps)

Goal: Test whether adaptive/hybrid approaches improve with more data (they should — tracker scales linearly, adaptive generalizes).

Same 4 configs as Experiment 2, but with --total_optim_steps 40 --eval_every_optim_steps 10.

Hypothesis: Adaptive/hybrid should close the gap or overtake baseline at higher step counts, because the ridge model gets better predictions with more observations.

Metrics to Track

Primary: Stratified eval accuracy at each checkpoint.

Secondary (exploration efficiency):

  • Acceptance rate of fresh groups (should be higher for adaptive/hybrid)
  • Unique clusters sampled (should be more diverse for cluster/hybrid)
  • Number of generation attempts per useful group (lower = better)
  • Time-to-buffer-full (how quickly the buffer fills with good groups)

Diagnostic:

  • Per-cluster update counts (should be more balanced with update-on-rejection fix)
  • Ridge/kNN prediction accuracy (correlation of predicted vs actual variance)
  • KL divergence and grad norm stability

Implementation Priority

  1. Update-on-rejection fix (30 min) — biggest bang for buck, fixes Thompson collapse
  2. Stratified eval default (5 min) — fair comparison
  3. HybridScorer (2 hours) — the core architectural change
  4. explore_frac=0.25 (CLI flag, already exists)
  5. Run experiments (~6 hours per experiment)

Total implementation time: ~3 hours before first experiment launch.

Risk Assessment

  • Experiment 1 (fix + re-run): Low risk, high value. Even if results don't change, we'll know the fixes work.
  • Experiment 2 (hybrid): Medium risk. The hybrid architecture is sound in theory, but the exploration benefit may not overcome the 3.2% coverage problem in 20 steps.
  • Experiment 3 (longer training): Medium risk. 40 steps doubles training time but may show the crossover point where adaptive starts winning.

Biggest risk: The fundamental problem may be that 7500 problems with 4096-dim embeddings is too sparse for any online learning to work in 20-40 steps. If ridge/kNN can't predict variance from embeddings (and our probe study showed r=0.058), then the adaptive scorer is trying to learn an unlearnable function.

Counter-argument: The probe study used FIXED ground-truth variance from historical runs. The adaptive scorer learns from ONLINE observations where the model is changing. Early-training variance patterns may be more predictable than cross-run averages. Also, the probe used the model's own hidden states — the adaptive scorer uses pre-computed task embeddings, which capture task structure rather than model uncertainty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment