We can predict which math problems will produce high reward variance (the useful training signal for GRPO) by measuring the model's logit entropy when prompted with the ground-truth answer. If the model is uncertain about how to derive a known answer, the problem is at the frontier — neither trivially easy nor impossibly hard.
- Does answer-conditioned entropy correlate with empirical reward variance? (The core signal question)
- Does plain generation entropy also correlate? (Cheaper baseline — maybe we don't need the distillation trick)
- Does kNN over problem embeddings predict reward variance? (The embedding-based alternative to DOTS)
- Which signal best predicts "useful training group" (nonzero reward variance)?
We already have eval DBs for multiple runs at multiple steps. But we need training reward variance per problem, not eval correctness. Two options:
Option A (fast, approximate): Use eval correctness across multiple runs as a proxy. We have runs 74-77, 96-102 — all 1.7B, same base model, different scorers. For each problem in the val set, compute the fraction correct across runs/steps. Problems with ~50% accuracy across runs are "frontier."
Option B (better, needs rollouts): Generate N rollouts per problem with the base model (step 0 checkpoint, temperature=1.0) via vLLM. Compute actual reward variance per problem from the rollout outcomes. This gives ground-truth reward variance at step 0, which is what the difficulty estimator would predict at training start.
Plan: Do Option A first (tonight, no GPU). If promising, do Option B when GPU is available.
For each problem in the dataset (~5000 train, ~1250 val), compute three signals using the base Qwen3-1.7B model (or the step-0 LoRA checkpoint):
Prompt: "<problem text>\n\nThe answer is {ground_truth}. Let me explain the solution step by step."
- Run a single forward pass through the model
- Measure the entropy of the logit distribution at the last token (the first token the model would generate — the start of the reasoning trace)
- Also measure mean entropy over the first k=16 generated tokens (cheap: greedy decode 16 tokens, measure logits at each step)
- High entropy = model doesn't know how to derive this answer = frontier problem
Prompt: "<problem text>" (the normal chat-templated prompt)
- Same procedure: forward pass, measure entropy of first token and first k=16 tokens
- High entropy = model is uncertain about how to start reasoning
- This is cheaper (no answer conditioning) but conflates answer uncertainty with reasoning path uncertainty
- Load cached embeddings from
clusters/embeddings.npy - For each problem, find k=10 nearest neighbors among all problems with known reward variance
- Predicted variance = distance-weighted average of neighbors' variances
- This is what we'd use for cold-start (new problems with no rollout history)
For each signal, compute:
- Spearman rank correlation with ground-truth reward variance
- AUROC for binary classification: "will this problem produce a useful training group?" (reward variance > threshold)
- Calibration plot: binned signal value vs actual reward variance
- Breakdown by problem type and level: does the signal work uniformly or only for certain categories?
Using the best signal(s), simulate what would happen if we filtered problems before sending to vLLM:
- Rank all problems by predicted usefulness
- Select top-K problems (K = typical training budget)
- Compute actual reward variance of selected set vs random selection
- Estimate rollout savings: what % of generation budget would we save by filtering out predicted-zero-variance groups?
Single script that computes all three signals and runs the correlation analysis. Outputs results as JSON + generates an HTML report.
Usage:
# Phase 1: Collect ground truth from eval DBs (no GPU)
python difficulty_signals.py collect-ground-truth --runs 74,75,76,77 --output research/ground_truth.json
# Phase 2: Compute signals (needs 1 GPU)
CUDA_VISIBLE_DEVICES=0 python difficulty_signals.py compute-signals \
--model_id Qwen/Qwen3-1.7B \
--checkpoint runs/74/step=0 \
--ground_truth research/ground_truth.json \
--embeddings clusters/embeddings.npy \
--output research/signals.json
# Phase 3+4: Analyze (no GPU)
python difficulty_signals.py analyze \
--signals research/signals.json \
--ground_truth research/ground_truth.json \
--output research/analysis.html
def collect_ground_truth(run_dirs, steps=None):
"""
For each problem in the val set, compute:
- per-run accuracy at each step
- cross-run accuracy (fraction of runs where correct)
- approximate "variance" = p*(1-p) where p = cross-run accuracy
Returns: {task_id: {accuracy: float, variance: float, n_evals: int, per_run: {...}}}
"""This uses existing eval DBs — runs 74-77 and 96-102 give us 9 independent runs with eval at steps 5,10,15,20. For step-0 variance, we can use just the step-5 eval across all runs (closest to base model behavior).
def compute_answer_conditioned_entropy(model, tokenizer, problems, k_tokens=16):
"""
For each problem:
1. Format prompt: problem + "The answer is {answer}. Let me explain..."
2. Forward pass to get logits at the generation boundary
3. Compute entropy of logit distribution
4. Optionally: greedy decode k_tokens, compute mean entropy
Returns: {problem_id: {first_token_entropy: float, mean_k_entropy: float}}
"""
def compute_generation_entropy(model, tokenizer, problems, k_tokens=16):
"""
Same but with the normal chat-templated prompt (no answer conditioning).
"""
def compute_knn_predicted_variance(embeddings, ground_truth, k=10):
"""
For each problem, find k nearest neighbors with known variance.
Return distance-weighted average variance as prediction.
"""Key implementation detail: We load the model in 4-bit (same as training) to
measure entropy under the actual training distribution. Use
model.generate(..., max_new_tokens=k, return_dict_in_generate=True, output_scores=True)
to get per-token logits without writing a custom generation loop.
Generate an HTML report with:
- Scatter plots: each signal vs ground-truth variance (one plot per signal)
- ROC curves: predicting "useful group" (variance > 0.05)
- Bar chart: correlation by problem type
- Filtering simulation: accuracy of top-K selection vs random
- Table: head-to-head comparison of all signals
- Hardware: 1 GPU for Phase 2 (loading Qwen3-1.7B in 4-bit, ~2GB VRAM). Phases 1, 3, 4 are CPU-only.
- Time estimate:
- Phase 1: ~2 min (reading SQLite DBs)
- Phase 2: ~30-60 min (forward passes on ~6000 problems, k=16 tokens each)
- Phase 3+4: ~1 min (numpy/scipy)
- Dependencies: torch, transformers, peft, bitsandbytes (already installed), scipy (for correlation tests), sklearn (already installed)
# Step 1: Collect ground truth from eval DBs (no GPU, ~2 min)
uv run python difficulty_signals.py collect-ground-truth \
--runs 74,75,76,77,96,97,98,99,100,101,102 \
--output research/ground_truth.json
# Step 2: Compute all signals (1 GPU, ~30-60 min)
CUDA_VISIBLE_DEVICES=0 uv run python difficulty_signals.py compute-signals \
--model_id Qwen/Qwen3-1.7B \
--checkpoint runs/74/step=0 \
--ground_truth research/ground_truth.json \
--embeddings clusters/embeddings.npy \
--output research/signals.json
# Step 3: Analyze and generate report (no GPU, ~1 min)
uv run python difficulty_signals.py analyze \
--signals research/signals.json \
--ground_truth research/ground_truth.json \
--output research/entropy_analysis.html
echo "Done! Open research/entropy_analysis.html to view results."Total wall time: ~1 hour. Safe to run overnight.
- Strong signal: Spearman |r| > 0.3 between answer-conditioned entropy and reward variance would be a publishable finding
- Better than baselines: Answer-conditioned entropy should beat plain generation entropy (validating that the distillation trick adds value)
- Practical filtering: Selecting top-50% of problems by predicted variance should capture >80% of actual high-variance problems (the useful training groups)
- Integrate the best signal into
Generatoras a pre-rollout filter - Run a full training experiment: filtered GRPO vs baseline GRPO
- Measure: same final accuracy with fewer rollouts? (the paper result)
- Test at different training stages: does the signal stay predictive as the policy evolves? (requires re-computing entropy periodically)