Skip to content

Instantly share code, notes, and snippets.

@aria42
Created February 25, 2026 23:32
Show Gist options
  • Select an option

  • Save aria42/8d99eeaf1067e9ca15ff9bce8de00356 to your computer and use it in GitHub Desktop.

Select an option

Save aria42/8d99eeaf1067e9ca15ff9bce8de00356 to your computer and use it in GitHub Desktop.
Literature Review: Curriculum Learning & Proposal Distributions for GRPO
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Literature Review: Curriculum Learning &amp; Proposal Distributions for GRPO</title>
<style>
:root {
--bg: #0d1117;
--surface: #161b22;
--border: #30363d;
--text: #e6edf3;
--text-muted: #8b949e;
--accent: #58a6ff;
--accent2: #3fb950;
--accent3: #d2a8ff;
--accent4: #f0883e;
--accent5: #f85149;
}
* { margin: 0; padding: 0; box-sizing: border-box; }
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif;
background: var(--bg);
color: var(--text);
line-height: 1.6;
padding: 2rem;
max-width: 1100px;
margin: 0 auto;
}
h1 {
font-size: 1.8rem;
margin-bottom: 0.5rem;
color: var(--accent);
}
.subtitle {
color: var(--text-muted);
margin-bottom: 2rem;
font-size: 0.95rem;
}
h2 {
font-size: 1.3rem;
margin: 2rem 0 1rem;
padding-bottom: 0.4rem;
border-bottom: 1px solid var(--border);
color: var(--accent3);
}
h3 {
font-size: 1.05rem;
margin: 1.2rem 0 0.4rem;
color: var(--accent2);
}
p, li { color: var(--text); font-size: 0.92rem; }
p { margin-bottom: 0.6rem; }
a { color: var(--accent); text-decoration: none; }
a:hover { text-decoration: underline; }
.callout {
background: var(--surface);
border-left: 3px solid var(--accent);
padding: 0.8rem 1rem;
margin: 1rem 0;
border-radius: 0 6px 6px 0;
}
.callout.warn { border-left-color: var(--accent4); }
.callout.good { border-left-color: var(--accent2); }
.callout strong { color: var(--accent); }
.callout.warn strong { color: var(--accent4); }
.callout.good strong { color: var(--accent2); }
.paper {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 8px;
padding: 1rem 1.2rem;
margin: 0.8rem 0;
}
.paper-title {
font-weight: 600;
color: var(--accent2);
font-size: 1rem;
}
.paper-link {
font-size: 0.82rem;
color: var(--text-muted);
margin-bottom: 0.5rem;
}
.paper-tags {
display: flex;
gap: 0.4rem;
flex-wrap: wrap;
margin: 0.5rem 0;
}
.tag {
font-size: 0.72rem;
padding: 2px 8px;
border-radius: 12px;
font-weight: 500;
}
.tag.difficulty { background: #1f3a2a; color: #3fb950; }
.tag.gradient { background: #2a1f3a; color: #d2a8ff; }
.tag.content { background: #3a2a1f; color: #f0883e; }
.tag.venue { background: #1f2a3a; color: #58a6ff; }
.tag.result { background: #3a1f1f; color: #f85149; }
.detail { color: var(--text-muted); font-size: 0.88rem; }
.key-point { margin: 0.3rem 0 0.3rem 1rem; font-size: 0.88rem; }
.key-point::before { content: "→ "; color: var(--accent4); font-weight: bold; }
table {
width: 100%;
border-collapse: collapse;
margin: 1rem 0;
font-size: 0.85rem;
}
th, td {
padding: 0.5rem 0.7rem;
text-align: left;
border: 1px solid var(--border);
}
th { background: var(--surface); color: var(--accent3); font-weight: 600; }
td { color: var(--text); }
tr:nth-child(even) td { background: rgba(22, 27, 34, 0.5); }
.action-item {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 8px;
padding: 0.8rem 1rem;
margin: 0.6rem 0;
display: flex;
gap: 0.8rem;
align-items: flex-start;
}
.action-num {
background: var(--accent);
color: var(--bg);
font-weight: 700;
width: 28px;
height: 28px;
border-radius: 50%;
display: flex;
align-items: center;
justify-content: center;
flex-shrink: 0;
font-size: 0.85rem;
}
.action-body { flex: 1; }
.action-body strong { color: var(--accent); }
.action-body .source { font-size: 0.8rem; color: var(--text-muted); }
</style>
</head>
<body>
<h1>Curriculum Learning &amp; Proposal Distributions for GRPO</h1>
<p class="subtitle">Literature review &mdash; Feb 2026. Connections to our cluster/priority sampling experiments noted throughout.</p>
<div class="callout">
<strong>Core problem:</strong> Under uniform sampling, GRPO wastes most compute. Easy problems &rarr; all-correct groups (zero gradient). Hard problems &rarr; all-wrong groups (zero gradient). As training progresses, informative prompts shrink to ~10% of the pool.
</div>
<!-- ───────────────────────── SECTION 2 ───────────────────────── -->
<h2>Difficulty-Only Approaches</h2>
<p class="detail">Closest to our <code>DifficultyTracker</code> (EMA reward variance + UCB scoring).</p>
<div class="paper">
<div class="paper-title">VCRL: Variance-based Curriculum Reinforcement Learning</div>
<div class="paper-link"><a href="https://arxiv.org/abs/2509.19803">arxiv.org/abs/2509.19803</a></div>
<div class="paper-tags">
<span class="tag difficulty">difficulty</span>
<span class="tag venue">2025</span>
</div>
<p>Uses reward variance directly as curriculum signal. Moderate-variance samples are optimal. Dynamically shifts sampling toward the "frontier."</p>
<p class="key-point">Validates that reward variance (what our <code>DifficultyTracker</code> uses) is the right signal.</p>
<p class="key-point">Beats GRPO/DAPO/GSPO on 5 math benchmarks.</p>
</div>
<div class="paper">
<div class="paper-title">DOTS: Difficulty-targeted Online Data Selection &amp; Rollout Replay</div>
<div class="paper-link"><a href="https://arxiv.org/abs/2506.05316">arxiv.org/abs/2506.05316</a></div>
<div class="paper-tags">
<span class="tag difficulty">difficulty</span>
<span class="tag venue">NeurIPS 2025</span>
<span class="tag result">23-62% speedup</span>
</div>
<p>Targets ~0.5 pass rate (information-theoretic sweet spot). Estimates difficulty via attention-based similarity on a small reference set &mdash; avoids rolling out every candidate. Bounded FIFO replay buffer.</p>
<p class="key-point">Their replay buffer is very similar to our <code>BufferStore</code> (age-based eviction). Their difficulty estimation without rollouts could save vLLM compute vs our roll-out-then-reject approach.</p>
</div>
<div class="paper">
<div class="paper-title">Hard Examples Are All You Need</div>
<div class="paper-link"><a href="https://arxiv.org/abs/2508.14094">arxiv.org/abs/2508.14094</a></div>
<div class="paper-tags">
<span class="tag difficulty">difficulty</span>
<span class="tag result">up to +47%</span>
</div>
<p>Contrarian take: training on the <strong>hardest 10%</strong> (by base model pass rate) yields up to 47% gains. Hard examples maintain mixed outcomes throughout training; easy ones plateau.</p>
<p class="key-point">Interesting tension with our finding that <code>cluster_level</code> (difficulty-only clustering with 5 buckets) consistently hurt. Possibly too coarse, or EMA difficulty &ne; base-model difficulty.</p>
</div>
<div class="paper">
<div class="paper-title">DEPO: Difficulty-Estimated Policy Optimization</div>
<div class="paper-link"><a href="https://arxiv.org/abs/2602.06375">arxiv.org/abs/2602.06375</a></div>
<div class="paper-tags">
<span class="tag difficulty">difficulty</span>
<span class="tag result">2x cost reduction</span>
</div>
<p>Lightweight online difficulty estimator filters <strong>before rollout</strong>. Avoids wasting vLLM compute on zero-variance groups.</p>
<p class="key-point">Our rejection sampling filters <em>after</em> generation. DEPO's pre-rollout filtering could reduce wasted <code>max_gen_rounds</code>.</p>
</div>
<!-- ───────────────────────── SECTION 3 ───────────────────────── -->
<h2>Gradient-Aware Approaches</h2>
<p class="detail">Goes deeper than reward variance &mdash; analyzes the actual gradient signal.</p>
<div class="paper">
<div class="paper-title">CurES: From Gradient Analysis to Efficient Curriculum Learning</div>
<div class="paper-link"><a href="https://arxiv.org/abs/2510.01037">arxiv.org/abs/2510.01037</a></div>
<div class="paper-tags">
<span class="tag gradient">gradient-aware</span>
<span class="tag result">+3.3pt (1.5B), +4.8pt (7B)</span>
</div>
<p>Most theoretically grounded. Gradient analysis reveals <strong>two independent levers</strong>:</p>
<ol style="margin: 0.3rem 0 0.3rem 1.5rem; font-size: 0.88rem;">
<li><strong>Prompt sampling distribution</strong> dictates convergence rate (which problems)</li>
<li><strong>Rollout quantity allocation</strong> per prompt affects gradient stability (how many samples)</li>
</ol>
<p>Uses Bayesian posterior estimation to set both.</p>
<div class="callout warn">
<strong>Key insight for us:</strong> We use fixed <code>group_size=8</code> for all prompts. CurES says allocate more rollouts to frontier problems, fewer to easy/impossible ones. This is orthogonal to cluster work.
</div>
</div>
<div class="paper">
<div class="paper-title">Reinforce-Ada: Adaptive Sampling Framework</div>
<div class="paper-link"><a href="https://arxiv.org/abs/2510.04996">arxiv.org/abs/2510.04996</a></div>
<div class="paper-tags">
<span class="tag gradient">gradient-aware</span>
<span class="tag result">2x convergence speedup</span>
</div>
<p>Optimal per-prompt group size is proportional to difficulty. Signal loss at high accuracy is from <strong>undersampling</strong>, not fundamental. Larger groups for hard prompts recover learning signals.</p>
<p class="key-point">Same total inference budget, 2x convergence. Variable group sizes could be a big win.</p>
</div>
<div class="paper">
<div class="paper-title">DaGRPO: Distinctiveness-Aware GRPO</div>
<div class="paper-link"><a href="https://arxiv.org/abs/2512.06337">arxiv.org/abs/2512.06337</a></div>
<div class="paper-tags">
<span class="tag gradient">gradient-aware</span>
<span class="tag result">+4.7% math</span>
</div>
<p>Even within a group, homogeneous outputs create <strong>gradient conflicts</strong>. Masks sample pairs with low "distinctiveness" and augments hard problems with off-policy positives.</p>
<p class="key-point">At 84% accuracy (our 8B run), most groups are all-correct with near-identical outputs. DaGRPO masking could help.</p>
</div>
<!-- ───────────────────────── SECTION 4 ───────────────────────── -->
<h2>Content-Aware &amp; Domain-Weighting Approaches</h2>
<p class="detail">Closest to our <code>ClusterScorer</code> / Thompson sampling on clusters.</p>
<div class="paper">
<div class="paper-title">GRPO-LEAD: Difficulty-Aware RL + Length Regularization</div>
<div class="paper-link"><a href="https://arxiv.org/abs/2504.09696">arxiv.org/abs/2504.09696</a></div>
<div class="paper-tags">
<span class="tag content">content-aware</span>
<span class="tag venue">EMNLP 2025</span>
</div>
<p>Upweights advantages on harder problems in the <strong>loss computation</strong> (not sampling stage). Length-regularized rewards for conciseness.</p>
<p class="key-point">Our reward shaping (think_close_penalty, length_penalty) is similar to their length regularization. Their advantage reweighting is complementary &mdash; modify the loss, not the proposal distribution.</p>
</div>
<div class="paper">
<div class="paper-title">DoReMi: Optimizing Data Mixtures via Minimax</div>
<div class="paper-link"><a href="https://arxiv.org/abs/2305.10429">arxiv.org/abs/2305.10429</a></div>
<div class="paper-tags">
<span class="tag content">domain-weighting</span>
<span class="tag venue">NeurIPS 2023</span>
</div>
<p>Pretraining analog of our cluster Thompson sampling. Group DRO across domains: <strong>minimax over worst-case excess loss</strong>. Upweights domains where the model has the largest gap between current and reference model.</p>
<div class="callout warn">
<strong>Key insight for us:</strong> Might fix <code>cluster_level</code> underperformance. Thompson sampling on 5 buckets can't converge in 20 steps, but minimax would directly target the weakest cluster without needing to estimate variance.
</div>
</div>
<div class="paper">
<div class="paper-title">VADE: Variance-Aware Dynamic Sampling (Beta + Thompson)</div>
<div class="paper-link"><a href="https://arxiv.org/abs/2511.18902">arxiv.org/abs/2511.18902</a></div>
<div class="paper-tags">
<span class="tag difficulty">difficulty</span>
<span class="tag content">content-aware</span>
</div>
<p>Online per-sample difficulty via Beta distributions + Thompson sampling + two-scale prior decay for non-stationarity. Plug-and-play for GRPO/GSPO.</p>
<p class="key-point">Most similar to our per-problem Thompson sampling. Their prior decay mechanism handles policy evolution &mdash; something our EMA alpha partially addresses.</p>
</div>
<!-- ───────────────────────── MAPPING TABLE ───────────────────────── -->
<h2>Mapping to Our Experiments</h2>
<table>
<thead>
<tr><th>Our Approach</th><th>Closest Paper</th><th>Key Difference</th></tr>
</thead>
<tbody>
<tr>
<td><code>DifficultyTracker</code> (EMA + UCB)</td>
<td>VCRL, DOTS</td>
<td>DOTS estimates difficulty <em>without</em> rolling out every prompt</td>
</tr>
<tr>
<td>Rejection sampling (zero-var filter)</td>
<td>DEPO</td>
<td>DEPO filters <em>before</em> rollout, saving vLLM compute</td>
</tr>
<tr>
<td><code>BufferStore</code> replay</td>
<td>DOTS replay buffer</td>
<td>Very similar; DOTS uses FIFO, we use age + replay count</td>
</tr>
<tr>
<td>Fixed <code>group_size=8</code></td>
<td>CurES, Reinforce-Ada</td>
<td><strong>Variable group sizes per prompt</strong> &mdash; more for hard</td>
</tr>
<tr>
<td><code>ClusterScorer</code> (Thompson)</td>
<td>DoReMi (minimax)</td>
<td>Worst-case optimization vs variance-proportional sampling</td>
</tr>
<tr>
<td><code>type_level</code> wins, <code>level</code> hurts</td>
<td>"Hard Examples" paper</td>
<td>Content structure &gt; difficulty alone for clustering</td>
</tr>
<tr>
<td>k=15 embed clusters &asymp; type_level</td>
<td><em>No direct analog</em></td>
<td>Optimal k matching metadata cluster count appears novel</td>
</tr>
</tbody>
</table>
<!-- ───────────────────────── ACTIONABLE ───────────────────────── -->
<h2>Most Actionable Ideas</h2>
<div class="action-item">
<div class="action-num">1</div>
<div class="action-body">
<strong>Variable group sizes</strong> &mdash; Instead of <code>group_size=8</code> for all prompts, allocate 12&ndash;16 rollouts for frontier problems and 4 for easy ones. Same total vLLM budget, better gradient signal. Orthogonal to cluster work.
<div class="source">CurES, Reinforce-Ada</div>
</div>
</div>
<div class="action-item">
<div class="action-num">2</div>
<div class="action-body">
<strong>Pre-rollout filtering</strong> &mdash; Train a cheap classifier on (problem_embedding, step) &rarr; predicted reward variance. Skip problems predicted all-correct or all-wrong <em>before</em> sending to vLLM. Reduces wasted <code>max_gen_rounds</code>.
<div class="source">DEPO</div>
</div>
</div>
<div class="action-item">
<div class="action-num">3</div>
<div class="action-body">
<strong>Minimax cluster reweighting</strong> &mdash; Instead of Thompson on cluster variance, upweight clusters with the largest gap between current and baseline performance. May fix <code>cluster_level</code> underperformance (5 buckets too few for Thompson, but minimax targets weakest directly).
<div class="source">DoReMi</div>
</div>
</div>
<div class="action-item">
<div class="action-num">4</div>
<div class="action-body">
<strong>Within-group distinctiveness masking</strong> &mdash; When all 8 outputs are near-identical (common at high accuracy), mask redundant pairs to prevent gradient conflicts. Cheap: just compare output similarity.
<div class="source">DaGRPO</div>
</div>
</div>
<div class="action-item">
<div class="action-num">5</div>
<div class="action-body">
<strong>Advantage reweighting by difficulty</strong> &mdash; Upweight advantages on harder problems in the loss, complementing proposal-side changes. Can combine with length regularization we already have.
<div class="source">GRPO-LEAD</div>
</div>
</div>
<p style="margin-top: 2rem; color: var(--text-muted); font-size: 0.8rem; text-align: center;">
Generated Feb 2026 &mdash; lora-without-regret project
</p>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment