Created
February 25, 2026 23:32
-
-
Save aria42/8d99eeaf1067e9ca15ff9bce8de00356 to your computer and use it in GitHub Desktop.
Literature Review: Curriculum Learning & Proposal Distributions for GRPO
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| <!DOCTYPE html> | |
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <title>Literature Review: Curriculum Learning & Proposal Distributions for GRPO</title> | |
| <style> | |
| :root { | |
| --bg: #0d1117; | |
| --surface: #161b22; | |
| --border: #30363d; | |
| --text: #e6edf3; | |
| --text-muted: #8b949e; | |
| --accent: #58a6ff; | |
| --accent2: #3fb950; | |
| --accent3: #d2a8ff; | |
| --accent4: #f0883e; | |
| --accent5: #f85149; | |
| } | |
| * { margin: 0; padding: 0; box-sizing: border-box; } | |
| body { | |
| font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; | |
| background: var(--bg); | |
| color: var(--text); | |
| line-height: 1.6; | |
| padding: 2rem; | |
| max-width: 1100px; | |
| margin: 0 auto; | |
| } | |
| h1 { | |
| font-size: 1.8rem; | |
| margin-bottom: 0.5rem; | |
| color: var(--accent); | |
| } | |
| .subtitle { | |
| color: var(--text-muted); | |
| margin-bottom: 2rem; | |
| font-size: 0.95rem; | |
| } | |
| h2 { | |
| font-size: 1.3rem; | |
| margin: 2rem 0 1rem; | |
| padding-bottom: 0.4rem; | |
| border-bottom: 1px solid var(--border); | |
| color: var(--accent3); | |
| } | |
| h3 { | |
| font-size: 1.05rem; | |
| margin: 1.2rem 0 0.4rem; | |
| color: var(--accent2); | |
| } | |
| p, li { color: var(--text); font-size: 0.92rem; } | |
| p { margin-bottom: 0.6rem; } | |
| a { color: var(--accent); text-decoration: none; } | |
| a:hover { text-decoration: underline; } | |
| .callout { | |
| background: var(--surface); | |
| border-left: 3px solid var(--accent); | |
| padding: 0.8rem 1rem; | |
| margin: 1rem 0; | |
| border-radius: 0 6px 6px 0; | |
| } | |
| .callout.warn { border-left-color: var(--accent4); } | |
| .callout.good { border-left-color: var(--accent2); } | |
| .callout strong { color: var(--accent); } | |
| .callout.warn strong { color: var(--accent4); } | |
| .callout.good strong { color: var(--accent2); } | |
| .paper { | |
| background: var(--surface); | |
| border: 1px solid var(--border); | |
| border-radius: 8px; | |
| padding: 1rem 1.2rem; | |
| margin: 0.8rem 0; | |
| } | |
| .paper-title { | |
| font-weight: 600; | |
| color: var(--accent2); | |
| font-size: 1rem; | |
| } | |
| .paper-link { | |
| font-size: 0.82rem; | |
| color: var(--text-muted); | |
| margin-bottom: 0.5rem; | |
| } | |
| .paper-tags { | |
| display: flex; | |
| gap: 0.4rem; | |
| flex-wrap: wrap; | |
| margin: 0.5rem 0; | |
| } | |
| .tag { | |
| font-size: 0.72rem; | |
| padding: 2px 8px; | |
| border-radius: 12px; | |
| font-weight: 500; | |
| } | |
| .tag.difficulty { background: #1f3a2a; color: #3fb950; } | |
| .tag.gradient { background: #2a1f3a; color: #d2a8ff; } | |
| .tag.content { background: #3a2a1f; color: #f0883e; } | |
| .tag.venue { background: #1f2a3a; color: #58a6ff; } | |
| .tag.result { background: #3a1f1f; color: #f85149; } | |
| .detail { color: var(--text-muted); font-size: 0.88rem; } | |
| .key-point { margin: 0.3rem 0 0.3rem 1rem; font-size: 0.88rem; } | |
| .key-point::before { content: "→ "; color: var(--accent4); font-weight: bold; } | |
| table { | |
| width: 100%; | |
| border-collapse: collapse; | |
| margin: 1rem 0; | |
| font-size: 0.85rem; | |
| } | |
| th, td { | |
| padding: 0.5rem 0.7rem; | |
| text-align: left; | |
| border: 1px solid var(--border); | |
| } | |
| th { background: var(--surface); color: var(--accent3); font-weight: 600; } | |
| td { color: var(--text); } | |
| tr:nth-child(even) td { background: rgba(22, 27, 34, 0.5); } | |
| .action-item { | |
| background: var(--surface); | |
| border: 1px solid var(--border); | |
| border-radius: 8px; | |
| padding: 0.8rem 1rem; | |
| margin: 0.6rem 0; | |
| display: flex; | |
| gap: 0.8rem; | |
| align-items: flex-start; | |
| } | |
| .action-num { | |
| background: var(--accent); | |
| color: var(--bg); | |
| font-weight: 700; | |
| width: 28px; | |
| height: 28px; | |
| border-radius: 50%; | |
| display: flex; | |
| align-items: center; | |
| justify-content: center; | |
| flex-shrink: 0; | |
| font-size: 0.85rem; | |
| } | |
| .action-body { flex: 1; } | |
| .action-body strong { color: var(--accent); } | |
| .action-body .source { font-size: 0.8rem; color: var(--text-muted); } | |
| </style> | |
| </head> | |
| <body> | |
| <h1>Curriculum Learning & Proposal Distributions for GRPO</h1> | |
| <p class="subtitle">Literature review — Feb 2026. Connections to our cluster/priority sampling experiments noted throughout.</p> | |
| <div class="callout"> | |
| <strong>Core problem:</strong> Under uniform sampling, GRPO wastes most compute. Easy problems → all-correct groups (zero gradient). Hard problems → all-wrong groups (zero gradient). As training progresses, informative prompts shrink to ~10% of the pool. | |
| </div> | |
| <!-- ───────────────────────── SECTION 2 ───────────────────────── --> | |
| <h2>Difficulty-Only Approaches</h2> | |
| <p class="detail">Closest to our <code>DifficultyTracker</code> (EMA reward variance + UCB scoring).</p> | |
| <div class="paper"> | |
| <div class="paper-title">VCRL: Variance-based Curriculum Reinforcement Learning</div> | |
| <div class="paper-link"><a href="https://arxiv.org/abs/2509.19803">arxiv.org/abs/2509.19803</a></div> | |
| <div class="paper-tags"> | |
| <span class="tag difficulty">difficulty</span> | |
| <span class="tag venue">2025</span> | |
| </div> | |
| <p>Uses reward variance directly as curriculum signal. Moderate-variance samples are optimal. Dynamically shifts sampling toward the "frontier."</p> | |
| <p class="key-point">Validates that reward variance (what our <code>DifficultyTracker</code> uses) is the right signal.</p> | |
| <p class="key-point">Beats GRPO/DAPO/GSPO on 5 math benchmarks.</p> | |
| </div> | |
| <div class="paper"> | |
| <div class="paper-title">DOTS: Difficulty-targeted Online Data Selection & Rollout Replay</div> | |
| <div class="paper-link"><a href="https://arxiv.org/abs/2506.05316">arxiv.org/abs/2506.05316</a></div> | |
| <div class="paper-tags"> | |
| <span class="tag difficulty">difficulty</span> | |
| <span class="tag venue">NeurIPS 2025</span> | |
| <span class="tag result">23-62% speedup</span> | |
| </div> | |
| <p>Targets ~0.5 pass rate (information-theoretic sweet spot). Estimates difficulty via attention-based similarity on a small reference set — avoids rolling out every candidate. Bounded FIFO replay buffer.</p> | |
| <p class="key-point">Their replay buffer is very similar to our <code>BufferStore</code> (age-based eviction). Their difficulty estimation without rollouts could save vLLM compute vs our roll-out-then-reject approach.</p> | |
| </div> | |
| <div class="paper"> | |
| <div class="paper-title">Hard Examples Are All You Need</div> | |
| <div class="paper-link"><a href="https://arxiv.org/abs/2508.14094">arxiv.org/abs/2508.14094</a></div> | |
| <div class="paper-tags"> | |
| <span class="tag difficulty">difficulty</span> | |
| <span class="tag result">up to +47%</span> | |
| </div> | |
| <p>Contrarian take: training on the <strong>hardest 10%</strong> (by base model pass rate) yields up to 47% gains. Hard examples maintain mixed outcomes throughout training; easy ones plateau.</p> | |
| <p class="key-point">Interesting tension with our finding that <code>cluster_level</code> (difficulty-only clustering with 5 buckets) consistently hurt. Possibly too coarse, or EMA difficulty ≠ base-model difficulty.</p> | |
| </div> | |
| <div class="paper"> | |
| <div class="paper-title">DEPO: Difficulty-Estimated Policy Optimization</div> | |
| <div class="paper-link"><a href="https://arxiv.org/abs/2602.06375">arxiv.org/abs/2602.06375</a></div> | |
| <div class="paper-tags"> | |
| <span class="tag difficulty">difficulty</span> | |
| <span class="tag result">2x cost reduction</span> | |
| </div> | |
| <p>Lightweight online difficulty estimator filters <strong>before rollout</strong>. Avoids wasting vLLM compute on zero-variance groups.</p> | |
| <p class="key-point">Our rejection sampling filters <em>after</em> generation. DEPO's pre-rollout filtering could reduce wasted <code>max_gen_rounds</code>.</p> | |
| </div> | |
| <!-- ───────────────────────── SECTION 3 ───────────────────────── --> | |
| <h2>Gradient-Aware Approaches</h2> | |
| <p class="detail">Goes deeper than reward variance — analyzes the actual gradient signal.</p> | |
| <div class="paper"> | |
| <div class="paper-title">CurES: From Gradient Analysis to Efficient Curriculum Learning</div> | |
| <div class="paper-link"><a href="https://arxiv.org/abs/2510.01037">arxiv.org/abs/2510.01037</a></div> | |
| <div class="paper-tags"> | |
| <span class="tag gradient">gradient-aware</span> | |
| <span class="tag result">+3.3pt (1.5B), +4.8pt (7B)</span> | |
| </div> | |
| <p>Most theoretically grounded. Gradient analysis reveals <strong>two independent levers</strong>:</p> | |
| <ol style="margin: 0.3rem 0 0.3rem 1.5rem; font-size: 0.88rem;"> | |
| <li><strong>Prompt sampling distribution</strong> dictates convergence rate (which problems)</li> | |
| <li><strong>Rollout quantity allocation</strong> per prompt affects gradient stability (how many samples)</li> | |
| </ol> | |
| <p>Uses Bayesian posterior estimation to set both.</p> | |
| <div class="callout warn"> | |
| <strong>Key insight for us:</strong> We use fixed <code>group_size=8</code> for all prompts. CurES says allocate more rollouts to frontier problems, fewer to easy/impossible ones. This is orthogonal to cluster work. | |
| </div> | |
| </div> | |
| <div class="paper"> | |
| <div class="paper-title">Reinforce-Ada: Adaptive Sampling Framework</div> | |
| <div class="paper-link"><a href="https://arxiv.org/abs/2510.04996">arxiv.org/abs/2510.04996</a></div> | |
| <div class="paper-tags"> | |
| <span class="tag gradient">gradient-aware</span> | |
| <span class="tag result">2x convergence speedup</span> | |
| </div> | |
| <p>Optimal per-prompt group size is proportional to difficulty. Signal loss at high accuracy is from <strong>undersampling</strong>, not fundamental. Larger groups for hard prompts recover learning signals.</p> | |
| <p class="key-point">Same total inference budget, 2x convergence. Variable group sizes could be a big win.</p> | |
| </div> | |
| <div class="paper"> | |
| <div class="paper-title">DaGRPO: Distinctiveness-Aware GRPO</div> | |
| <div class="paper-link"><a href="https://arxiv.org/abs/2512.06337">arxiv.org/abs/2512.06337</a></div> | |
| <div class="paper-tags"> | |
| <span class="tag gradient">gradient-aware</span> | |
| <span class="tag result">+4.7% math</span> | |
| </div> | |
| <p>Even within a group, homogeneous outputs create <strong>gradient conflicts</strong>. Masks sample pairs with low "distinctiveness" and augments hard problems with off-policy positives.</p> | |
| <p class="key-point">At 84% accuracy (our 8B run), most groups are all-correct with near-identical outputs. DaGRPO masking could help.</p> | |
| </div> | |
| <!-- ───────────────────────── SECTION 4 ───────────────────────── --> | |
| <h2>Content-Aware & Domain-Weighting Approaches</h2> | |
| <p class="detail">Closest to our <code>ClusterScorer</code> / Thompson sampling on clusters.</p> | |
| <div class="paper"> | |
| <div class="paper-title">GRPO-LEAD: Difficulty-Aware RL + Length Regularization</div> | |
| <div class="paper-link"><a href="https://arxiv.org/abs/2504.09696">arxiv.org/abs/2504.09696</a></div> | |
| <div class="paper-tags"> | |
| <span class="tag content">content-aware</span> | |
| <span class="tag venue">EMNLP 2025</span> | |
| </div> | |
| <p>Upweights advantages on harder problems in the <strong>loss computation</strong> (not sampling stage). Length-regularized rewards for conciseness.</p> | |
| <p class="key-point">Our reward shaping (think_close_penalty, length_penalty) is similar to their length regularization. Their advantage reweighting is complementary — modify the loss, not the proposal distribution.</p> | |
| </div> | |
| <div class="paper"> | |
| <div class="paper-title">DoReMi: Optimizing Data Mixtures via Minimax</div> | |
| <div class="paper-link"><a href="https://arxiv.org/abs/2305.10429">arxiv.org/abs/2305.10429</a></div> | |
| <div class="paper-tags"> | |
| <span class="tag content">domain-weighting</span> | |
| <span class="tag venue">NeurIPS 2023</span> | |
| </div> | |
| <p>Pretraining analog of our cluster Thompson sampling. Group DRO across domains: <strong>minimax over worst-case excess loss</strong>. Upweights domains where the model has the largest gap between current and reference model.</p> | |
| <div class="callout warn"> | |
| <strong>Key insight for us:</strong> Might fix <code>cluster_level</code> underperformance. Thompson sampling on 5 buckets can't converge in 20 steps, but minimax would directly target the weakest cluster without needing to estimate variance. | |
| </div> | |
| </div> | |
| <div class="paper"> | |
| <div class="paper-title">VADE: Variance-Aware Dynamic Sampling (Beta + Thompson)</div> | |
| <div class="paper-link"><a href="https://arxiv.org/abs/2511.18902">arxiv.org/abs/2511.18902</a></div> | |
| <div class="paper-tags"> | |
| <span class="tag difficulty">difficulty</span> | |
| <span class="tag content">content-aware</span> | |
| </div> | |
| <p>Online per-sample difficulty via Beta distributions + Thompson sampling + two-scale prior decay for non-stationarity. Plug-and-play for GRPO/GSPO.</p> | |
| <p class="key-point">Most similar to our per-problem Thompson sampling. Their prior decay mechanism handles policy evolution — something our EMA alpha partially addresses.</p> | |
| </div> | |
| <!-- ───────────────────────── MAPPING TABLE ───────────────────────── --> | |
| <h2>Mapping to Our Experiments</h2> | |
| <table> | |
| <thead> | |
| <tr><th>Our Approach</th><th>Closest Paper</th><th>Key Difference</th></tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td><code>DifficultyTracker</code> (EMA + UCB)</td> | |
| <td>VCRL, DOTS</td> | |
| <td>DOTS estimates difficulty <em>without</em> rolling out every prompt</td> | |
| </tr> | |
| <tr> | |
| <td>Rejection sampling (zero-var filter)</td> | |
| <td>DEPO</td> | |
| <td>DEPO filters <em>before</em> rollout, saving vLLM compute</td> | |
| </tr> | |
| <tr> | |
| <td><code>BufferStore</code> replay</td> | |
| <td>DOTS replay buffer</td> | |
| <td>Very similar; DOTS uses FIFO, we use age + replay count</td> | |
| </tr> | |
| <tr> | |
| <td>Fixed <code>group_size=8</code></td> | |
| <td>CurES, Reinforce-Ada</td> | |
| <td><strong>Variable group sizes per prompt</strong> — more for hard</td> | |
| </tr> | |
| <tr> | |
| <td><code>ClusterScorer</code> (Thompson)</td> | |
| <td>DoReMi (minimax)</td> | |
| <td>Worst-case optimization vs variance-proportional sampling</td> | |
| </tr> | |
| <tr> | |
| <td><code>type_level</code> wins, <code>level</code> hurts</td> | |
| <td>"Hard Examples" paper</td> | |
| <td>Content structure > difficulty alone for clustering</td> | |
| </tr> | |
| <tr> | |
| <td>k=15 embed clusters ≈ type_level</td> | |
| <td><em>No direct analog</em></td> | |
| <td>Optimal k matching metadata cluster count appears novel</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| <!-- ───────────────────────── ACTIONABLE ───────────────────────── --> | |
| <h2>Most Actionable Ideas</h2> | |
| <div class="action-item"> | |
| <div class="action-num">1</div> | |
| <div class="action-body"> | |
| <strong>Variable group sizes</strong> — Instead of <code>group_size=8</code> for all prompts, allocate 12–16 rollouts for frontier problems and 4 for easy ones. Same total vLLM budget, better gradient signal. Orthogonal to cluster work. | |
| <div class="source">CurES, Reinforce-Ada</div> | |
| </div> | |
| </div> | |
| <div class="action-item"> | |
| <div class="action-num">2</div> | |
| <div class="action-body"> | |
| <strong>Pre-rollout filtering</strong> — Train a cheap classifier on (problem_embedding, step) → predicted reward variance. Skip problems predicted all-correct or all-wrong <em>before</em> sending to vLLM. Reduces wasted <code>max_gen_rounds</code>. | |
| <div class="source">DEPO</div> | |
| </div> | |
| </div> | |
| <div class="action-item"> | |
| <div class="action-num">3</div> | |
| <div class="action-body"> | |
| <strong>Minimax cluster reweighting</strong> — Instead of Thompson on cluster variance, upweight clusters with the largest gap between current and baseline performance. May fix <code>cluster_level</code> underperformance (5 buckets too few for Thompson, but minimax targets weakest directly). | |
| <div class="source">DoReMi</div> | |
| </div> | |
| </div> | |
| <div class="action-item"> | |
| <div class="action-num">4</div> | |
| <div class="action-body"> | |
| <strong>Within-group distinctiveness masking</strong> — When all 8 outputs are near-identical (common at high accuracy), mask redundant pairs to prevent gradient conflicts. Cheap: just compare output similarity. | |
| <div class="source">DaGRPO</div> | |
| </div> | |
| </div> | |
| <div class="action-item"> | |
| <div class="action-num">5</div> | |
| <div class="action-body"> | |
| <strong>Advantage reweighting by difficulty</strong> — Upweight advantages on harder problems in the loss, complementing proposal-side changes. Can combine with length regularization we already have. | |
| <div class="source">GRPO-LEAD</div> | |
| </div> | |
| </div> | |
| <p style="margin-top: 2rem; color: var(--text-muted); font-size: 0.8rem; text-align: center;"> | |
| Generated Feb 2026 — lora-without-regret project | |
| </p> | |
| </body> | |
| </html> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment