Skip to content

Instantly share code, notes, and snippets.

@aria42
Created February 26, 2026 05:27
Show Gist options
  • Select an option

  • Save aria42/04e3337ee8554f2f62ef24e9d41f4e80 to your computer and use it in GitHub Desktop.

Select an option

Save aria42/04e3337ee8554f2f62ef24e9d41f4e80 to your computer and use it in GitHub Desktop.
Experiment Plan: Answer-Conditioned Entropy as Zero-Rollout Difficulty Estimator

Experiment: Answer-Conditioned Entropy as a Zero-Rollout Difficulty Estimator

Thesis

We can predict which math problems will produce high reward variance (the useful training signal for GRPO) by measuring the model's logit entropy when prompted with the ground-truth answer. If the model is uncertain about how to derive a known answer, the problem is at the frontier — neither trivially easy nor impossibly hard.

What We're Validating

  1. Does answer-conditioned entropy correlate with empirical reward variance? (The core signal question)
  2. Does plain generation entropy also correlate? (Cheaper baseline — maybe we don't need the distillation trick)
  3. Does kNN over problem embeddings predict reward variance? (The embedding-based alternative to DOTS)
  4. Which signal best predicts "useful training group" (nonzero reward variance)?

Experimental Design

Phase 1: Collect Ground-Truth Reward Variance (no GPU needed)

We already have eval DBs for multiple runs at multiple steps. But we need training reward variance per problem, not eval correctness. Two options:

Option A (fast, approximate): Use eval correctness across multiple runs as a proxy. We have runs 74-77, 96-102 — all 1.7B, same base model, different scorers. For each problem in the val set, compute the fraction correct across runs/steps. Problems with ~50% accuracy across runs are "frontier."

Option B (better, needs rollouts): Generate N rollouts per problem with the base model (step 0 checkpoint, temperature=1.0) via vLLM. Compute actual reward variance per problem from the rollout outcomes. This gives ground-truth reward variance at step 0, which is what the difficulty estimator would predict at training start.

Plan: Do Option A first (tonight, no GPU). If promising, do Option B when GPU is available.

Phase 2: Compute Difficulty Estimator Signals (needs 1 GPU)

For each problem in the dataset (~5000 train, ~1250 val), compute three signals using the base Qwen3-1.7B model (or the step-0 LoRA checkpoint):

Signal 1: Answer-Conditioned Entropy (the novel idea)

Prompt: "<problem text>\n\nThe answer is {ground_truth}. Let me explain the solution step by step."
  • Run a single forward pass through the model
  • Measure the entropy of the logit distribution at the last token (the first token the model would generate — the start of the reasoning trace)
  • Also measure mean entropy over the first k=16 generated tokens (cheap: greedy decode 16 tokens, measure logits at each step)
  • High entropy = model doesn't know how to derive this answer = frontier problem

Signal 2: Plain Generation Entropy (cheap baseline)

Prompt: "<problem text>" (the normal chat-templated prompt)
  • Same procedure: forward pass, measure entropy of first token and first k=16 tokens
  • High entropy = model is uncertain about how to start reasoning
  • This is cheaper (no answer conditioning) but conflates answer uncertainty with reasoning path uncertainty

Signal 3: kNN over Embeddings

  • Load cached embeddings from clusters/embeddings.npy
  • For each problem, find k=10 nearest neighbors among all problems with known reward variance
  • Predicted variance = distance-weighted average of neighbors' variances
  • This is what we'd use for cold-start (new problems with no rollout history)

Phase 3: Correlation Analysis (no GPU)

For each signal, compute:

  1. Spearman rank correlation with ground-truth reward variance
  2. AUROC for binary classification: "will this problem produce a useful training group?" (reward variance > threshold)
  3. Calibration plot: binned signal value vs actual reward variance
  4. Breakdown by problem type and level: does the signal work uniformly or only for certain categories?

Phase 4: Simulate Pre-Rollout Filtering (no GPU)

Using the best signal(s), simulate what would happen if we filtered problems before sending to vLLM:

  1. Rank all problems by predicted usefulness
  2. Select top-K problems (K = typical training budget)
  3. Compute actual reward variance of selected set vs random selection
  4. Estimate rollout savings: what % of generation budget would we save by filtering out predicted-zero-variance groups?

Implementation Plan

File: difficulty_signals.py (new)

Single script that computes all three signals and runs the correlation analysis. Outputs results as JSON + generates an HTML report.

Usage:
  # Phase 1: Collect ground truth from eval DBs (no GPU)
  python difficulty_signals.py collect-ground-truth --runs 74,75,76,77 --output research/ground_truth.json

  # Phase 2: Compute signals (needs 1 GPU)
  CUDA_VISIBLE_DEVICES=0 python difficulty_signals.py compute-signals \
    --model_id Qwen/Qwen3-1.7B \
    --checkpoint runs/74/step=0 \
    --ground_truth research/ground_truth.json \
    --embeddings clusters/embeddings.npy \
    --output research/signals.json

  # Phase 3+4: Analyze (no GPU)
  python difficulty_signals.py analyze \
    --signals research/signals.json \
    --ground_truth research/ground_truth.json \
    --output research/analysis.html

Ground Truth Collection (Phase 1)

def collect_ground_truth(run_dirs, steps=None):
    """
    For each problem in the val set, compute:
    - per-run accuracy at each step
    - cross-run accuracy (fraction of runs where correct)
    - approximate "variance" = p*(1-p) where p = cross-run accuracy

    Returns: {task_id: {accuracy: float, variance: float, n_evals: int, per_run: {...}}}
    """

This uses existing eval DBs — runs 74-77 and 96-102 give us 9 independent runs with eval at steps 5,10,15,20. For step-0 variance, we can use just the step-5 eval across all runs (closest to base model behavior).

Signal Computation (Phase 2)

def compute_answer_conditioned_entropy(model, tokenizer, problems, k_tokens=16):
    """
    For each problem:
    1. Format prompt: problem + "The answer is {answer}. Let me explain..."
    2. Forward pass to get logits at the generation boundary
    3. Compute entropy of logit distribution
    4. Optionally: greedy decode k_tokens, compute mean entropy

    Returns: {problem_id: {first_token_entropy: float, mean_k_entropy: float}}
    """

def compute_generation_entropy(model, tokenizer, problems, k_tokens=16):
    """
    Same but with the normal chat-templated prompt (no answer conditioning).
    """

def compute_knn_predicted_variance(embeddings, ground_truth, k=10):
    """
    For each problem, find k nearest neighbors with known variance.
    Return distance-weighted average variance as prediction.
    """

Key implementation detail: We load the model in 4-bit (same as training) to measure entropy under the actual training distribution. Use model.generate(..., max_new_tokens=k, return_dict_in_generate=True, output_scores=True) to get per-token logits without writing a custom generation loop.

Analysis (Phase 3+4)

Generate an HTML report with:

  • Scatter plots: each signal vs ground-truth variance (one plot per signal)
  • ROC curves: predicting "useful group" (variance > 0.05)
  • Bar chart: correlation by problem type
  • Filtering simulation: accuracy of top-K selection vs random
  • Table: head-to-head comparison of all signals

What We Need

  • Hardware: 1 GPU for Phase 2 (loading Qwen3-1.7B in 4-bit, ~2GB VRAM). Phases 1, 3, 4 are CPU-only.
  • Time estimate:
    • Phase 1: ~2 min (reading SQLite DBs)
    • Phase 2: ~30-60 min (forward passes on ~6000 problems, k=16 tokens each)
    • Phase 3+4: ~1 min (numpy/scipy)
  • Dependencies: torch, transformers, peft, bitsandbytes (already installed), scipy (for correlation tests), sklearn (already installed)

Overnight Run Plan

# Step 1: Collect ground truth from eval DBs (no GPU, ~2 min)
uv run python difficulty_signals.py collect-ground-truth \
  --runs 74,75,76,77,96,97,98,99,100,101,102 \
  --output research/ground_truth.json

# Step 2: Compute all signals (1 GPU, ~30-60 min)
CUDA_VISIBLE_DEVICES=0 uv run python difficulty_signals.py compute-signals \
  --model_id Qwen/Qwen3-1.7B \
  --checkpoint runs/74/step=0 \
  --ground_truth research/ground_truth.json \
  --embeddings clusters/embeddings.npy \
  --output research/signals.json

# Step 3: Analyze and generate report (no GPU, ~1 min)
uv run python difficulty_signals.py analyze \
  --signals research/signals.json \
  --ground_truth research/ground_truth.json \
  --output research/entropy_analysis.html

echo "Done! Open research/entropy_analysis.html to view results."

Total wall time: ~1 hour. Safe to run overnight.

Success Criteria

  • Strong signal: Spearman |r| > 0.3 between answer-conditioned entropy and reward variance would be a publishable finding
  • Better than baselines: Answer-conditioned entropy should beat plain generation entropy (validating that the distillation trick adds value)
  • Practical filtering: Selecting top-50% of problems by predicted variance should capture >80% of actual high-variance problems (the useful training groups)

What Comes Next (if it works)

  1. Integrate the best signal into Generator as a pre-rollout filter
  2. Run a full training experiment: filtered GRPO vs baseline GRPO
  3. Measure: same final accuracy with fewer rollouts? (the paper result)
  4. Test at different training stages: does the signal stay predictive as the policy evolves? (requires re-computing entropy periodically)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment