aria42 · February 25, 2026 23:32
diff --git a/curriculum-learning-review.html b/curriculum-learning-review.html
 <!DOCTYPE html>
 <html lang="en">
 <head>
 <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
 <title>Literature Review: Curriculum Learning &amp; Proposal Distributions for GRPO</title>
 <style>
  :root {
    --bg: #0d1117;
    --surface: #161b22;
    --border: #30363d;
    --text: #e6edf3;
    --text-muted: #8b949e;
    --accent: #58a6ff;
    --accent2: #3fb950;
    --accent3: #d2a8ff;
    --accent4: #f0883e;
    --accent5: #f85149;
  }
  * { margin: 0; padding: 0; box-sizing: border-box; }
  body {
    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif;
    background: var(--bg);
    color: var(--text);
    line-height: 1.6;
    padding: 2rem;
    max-width: 1100px;
    margin: 0 auto;
  }
  h1 {
    font-size: 1.8rem;
    margin-bottom: 0.5rem;
    color: var(--accent);
  }
  .subtitle {
    color: var(--text-muted);
    margin-bottom: 2rem;
    font-size: 0.95rem;
  }
  h2 {
    font-size: 1.3rem;
    margin: 2rem 0 1rem;
    padding-bottom: 0.4rem;
    border-bottom: 1px solid var(--border);
    color: var(--accent3);
  }
  h3 {
    font-size: 1.05rem;
    margin: 1.2rem 0 0.4rem;
    color: var(--accent2);
  }
  p, li { color: var(--text); font-size: 0.92rem; }
  p { margin-bottom: 0.6rem; }
  a { color: var(--accent); text-decoration: none; }
  a:hover { text-decoration: underline; }

  .callout {
    background: var(--surface);
    border-left: 3px solid var(--accent);
    padding: 0.8rem 1rem;
    margin: 1rem 0;
    border-radius: 0 6px 6px 0;
  }
  .callout.warn { border-left-color: var(--accent4); }
  .callout.good { border-left-color: var(--accent2); }
  .callout strong { color: var(--accent); }
  .callout.warn strong { color: var(--accent4); }
  .callout.good strong { color: var(--accent2); }

  .paper {
    background: var(--surface);
    border: 1px solid var(--border);
    border-radius: 8px;
    padding: 1rem 1.2rem;
    margin: 0.8rem 0;
  }
  .paper-title {
    font-weight: 600;
    color: var(--accent2);
    font-size: 1rem;
  }
  .paper-link {
    font-size: 0.82rem;
    color: var(--text-muted);
    margin-bottom: 0.5rem;
  }
  .paper-tags {
    display: flex;
    gap: 0.4rem;
    flex-wrap: wrap;
    margin: 0.5rem 0;
  }
  .tag {
    font-size: 0.72rem;
    padding: 2px 8px;
    border-radius: 12px;
    font-weight: 500;
  }
  .tag.difficulty { background: #1f3a2a; color: #3fb950; }
  .tag.gradient { background: #2a1f3a; color: #d2a8ff; }
  .tag.content { background: #3a2a1f; color: #f0883e; }
  .tag.venue { background: #1f2a3a; color: #58a6ff; }
  .tag.result { background: #3a1f1f; color: #f85149; }

  .detail { color: var(--text-muted); font-size: 0.88rem; }
  .key-point { margin: 0.3rem 0 0.3rem 1rem; font-size: 0.88rem; }
  .key-point::before { content: "→ "; color: var(--accent4); font-weight: bold; }

  table {
    width: 100%;
    border-collapse: collapse;
    margin: 1rem 0;
    font-size: 0.85rem;
  }
  th, td {
    padding: 0.5rem 0.7rem;
    text-align: left;
    border: 1px solid var(--border);
  }
  th { background: var(--surface); color: var(--accent3); font-weight: 600; }
  td { color: var(--text); }
  tr:nth-child(even) td { background: rgba(22, 27, 34, 0.5); }

  .action-item {
    background: var(--surface);
    border: 1px solid var(--border);
    border-radius: 8px;
    padding: 0.8rem 1rem;
    margin: 0.6rem 0;
    display: flex;
    gap: 0.8rem;
    align-items: flex-start;
  }
  .action-num {
    background: var(--accent);
    color: var(--bg);
    font-weight: 700;
    width: 28px;
    height: 28px;
    border-radius: 50%;
    display: flex;
    align-items: center;
    justify-content: center;
    flex-shrink: 0;
    font-size: 0.85rem;
  }
  .action-body { flex: 1; }
  .action-body strong { color: var(--accent); }
  .action-body .source { font-size: 0.8rem; color: var(--text-muted); }
 </style>
 </head>
 <body>

 <h1>Curriculum Learning &amp; Proposal Distributions for GRPO</h1>
 <p class="subtitle">Literature review &mdash; Feb 2026. Connections to our cluster/priority sampling experiments noted throughout.</p>

 <div class="callout">
  <strong>Core problem:</strong> Under uniform sampling, GRPO wastes most compute. Easy problems &rarr; all-correct groups (zero gradient). Hard problems &rarr; all-wrong groups (zero gradient). As training progresses, informative prompts shrink to ~10% of the pool.
 </div>

 <!-- ───────────────────────── SECTION 2 ───────────────────────── -->
 <h2>Difficulty-Only Approaches</h2>
 <p class="detail">Closest to our <code>DifficultyTracker</code> (EMA reward variance + UCB scoring).</p>

 <div class="paper">
  <div class="paper-title">VCRL: Variance-based Curriculum Reinforcement Learning</div>
  <div class="paper-link"><a href="https://arxiv.org/abs/2509.19803">arxiv.org/abs/2509.19803</a></div>
  <div class="paper-tags">
    <span class="tag difficulty">difficulty</span>
    <span class="tag venue">2025</span>
  </div>
  <p>Uses reward variance directly as curriculum signal. Moderate-variance samples are optimal. Dynamically shifts sampling toward the "frontier."</p>
  <p class="key-point">Validates that reward variance (what our <code>DifficultyTracker</code> uses) is the right signal.</p>
  <p class="key-point">Beats GRPO/DAPO/GSPO on 5 math benchmarks.</p>
 </div>

 <div class="paper">
  <div class="paper-title">DOTS: Difficulty-targeted Online Data Selection &amp; Rollout Replay</div>
  <div class="paper-link"><a href="https://arxiv.org/abs/2506.05316">arxiv.org/abs/2506.05316</a></div>
  <div class="paper-tags">
    <span class="tag difficulty">difficulty</span>
    <span class="tag venue">NeurIPS 2025</span>
    <span class="tag result">23-62% speedup</span>
  </div>
  <p>Targets ~0.5 pass rate (information-theoretic sweet spot). Estimates difficulty via attention-based similarity on a small reference set &mdash; avoids rolling out every candidate. Bounded FIFO replay buffer.</p>
  <p class="key-point">Their replay buffer is very similar to our <code>BufferStore</code> (age-based eviction). Their difficulty estimation without rollouts could save vLLM compute vs our roll-out-then-reject approach.</p>
 </div>

 <div class="paper">
  <div class="paper-title">Hard Examples Are All You Need</div>
  <div class="paper-link"><a href="https://arxiv.org/abs/2508.14094">arxiv.org/abs/2508.14094</a></div>
  <div class="paper-tags">
    <span class="tag difficulty">difficulty</span>
    <span class="tag result">up to +47%</span>
  </div>
  <p>Contrarian take: training on the <strong>hardest 10%</strong> (by base model pass rate) yields up to 47% gains. Hard examples maintain mixed outcomes throughout training; easy ones plateau.</p>
  <p class="key-point">Interesting tension with our finding that <code>cluster_level</code> (difficulty-only clustering with 5 buckets) consistently hurt. Possibly too coarse, or EMA difficulty &ne; base-model difficulty.</p>
 </div>

 <div class="paper">
  <div class="paper-title">DEPO: Difficulty-Estimated Policy Optimization</div>
  <div class="paper-link"><a href="https://arxiv.org/abs/2602.06375">arxiv.org/abs/2602.06375</a></div>
  <div class="paper-tags">
    <span class="tag difficulty">difficulty</span>
    <span class="tag result">2x cost reduction</span>
  </div>
  <p>Lightweight online difficulty estimator filters <strong>before rollout</strong>. Avoids wasting vLLM compute on zero-variance groups.</p>
  <p class="key-point">Our rejection sampling filters <em>after</em> generation. DEPO's pre-rollout filtering could reduce wasted <code>max_gen_rounds</code>.</p>
 </div>

 <!-- ───────────────────────── SECTION 3 ───────────────────────── -->
 <h2>Gradient-Aware Approaches</h2>
 <p class="detail">Goes deeper than reward variance &mdash; analyzes the actual gradient signal.</p>

 <div class="paper">
  <div class="paper-title">CurES: From Gradient Analysis to Efficient Curriculum Learning</div>
  <div class="paper-link"><a href="https://arxiv.org/abs/2510.01037">arxiv.org/abs/2510.01037</a></div>
  <div class="paper-tags">
    <span class="tag gradient">gradient-aware</span>
    <span class="tag result">+3.3pt (1.5B), +4.8pt (7B)</span>
  </div>
  <p>Most theoretically grounded. Gradient analysis reveals <strong>two independent levers</strong>:</p>
  <ol style="margin: 0.3rem 0 0.3rem 1.5rem; font-size: 0.88rem;">
    <li><strong>Prompt sampling distribution</strong> dictates convergence rate (which problems)</li>
    <li><strong>Rollout quantity allocation</strong> per prompt affects gradient stability (how many samples)</li>
  </ol>
  <p>Uses Bayesian posterior estimation to set both.</p>
  <div class="callout warn">
    <strong>Key insight for us:</strong> We use fixed <code>group_size=8</code> for all prompts. CurES says allocate more rollouts to frontier problems, fewer to easy/impossible ones. This is orthogonal to cluster work.
  </div>
 </div>

 <div class="paper">
  <div class="paper-title">Reinforce-Ada: Adaptive Sampling Framework</div>
  <div class="paper-link"><a href="https://arxiv.org/abs/2510.04996">arxiv.org/abs/2510.04996</a></div>
  <div class="paper-tags">
    <span class="tag gradient">gradient-aware</span>
    <span class="tag result">2x convergence speedup</span>
  </div>
  <p>Optimal per-prompt group size is proportional to difficulty. Signal loss at high accuracy is from <strong>undersampling</strong>, not fundamental. Larger groups for hard prompts recover learning signals.</p>
  <p class="key-point">Same total inference budget, 2x convergence. Variable group sizes could be a big win.</p>
 </div>

 <div class="paper">
  <div class="paper-title">DaGRPO: Distinctiveness-Aware GRPO</div>
  <div class="paper-link"><a href="https://arxiv.org/abs/2512.06337">arxiv.org/abs/2512.06337</a></div>
  <div class="paper-tags">
    <span class="tag gradient">gradient-aware</span>
    <span class="tag result">+4.7% math</span>
  </div>
  <p>Even within a group, homogeneous outputs create <strong>gradient conflicts</strong>. Masks sample pairs with low "distinctiveness" and augments hard problems with off-policy positives.</p>
  <p class="key-point">At 84% accuracy (our 8B run), most groups are all-correct with near-identical outputs. DaGRPO masking could help.</p>
 </div>

 <!-- ───────────────────────── SECTION 4 ───────────────────────── -->
 <h2>Content-Aware &amp; Domain-Weighting Approaches</h2>
 <p class="detail">Closest to our <code>ClusterScorer</code> / Thompson sampling on clusters.</p>

 <div class="paper">
  <div class="paper-title">GRPO-LEAD: Difficulty-Aware RL + Length Regularization</div>
  <div class="paper-link"><a href="https://arxiv.org/abs/2504.09696">arxiv.org/abs/2504.09696</a></div>
  <div class="paper-tags">
    <span class="tag content">content-aware</span>
    <span class="tag venue">EMNLP 2025</span>
  </div>
  <p>Upweights advantages on harder problems in the <strong>loss computation</strong> (not sampling stage). Length-regularized rewards for conciseness.</p>
  <p class="key-point">Our reward shaping (think_close_penalty, length_penalty) is similar to their length regularization. Their advantage reweighting is complementary &mdash; modify the loss, not the proposal distribution.</p>
 </div>

 <div class="paper">
  <div class="paper-title">DoReMi: Optimizing Data Mixtures via Minimax</div>
  <div class="paper-link"><a href="https://arxiv.org/abs/2305.10429">arxiv.org/abs/2305.10429</a></div>
  <div class="paper-tags">
    <span class="tag content">domain-weighting</span>
    <span class="tag venue">NeurIPS 2023</span>
  </div>
  <p>Pretraining analog of our cluster Thompson sampling. Group DRO across domains: <strong>minimax over worst-case excess loss</strong>. Upweights domains where the model has the largest gap between current and reference model.</p>
  <div class="callout warn">
    <strong>Key insight for us:</strong> Might fix <code>cluster_level</code> underperformance. Thompson sampling on 5 buckets can't converge in 20 steps, but minimax would directly target the weakest cluster without needing to estimate variance.
  </div>
 </div>

 <div class="paper">
  <div class="paper-title">VADE: Variance-Aware Dynamic Sampling (Beta + Thompson)</div>
  <div class="paper-link"><a href="https://arxiv.org/abs/2511.18902">arxiv.org/abs/2511.18902</a></div>
  <div class="paper-tags">
    <span class="tag difficulty">difficulty</span>
    <span class="tag content">content-aware</span>
  </div>
  <p>Online per-sample difficulty via Beta distributions + Thompson sampling + two-scale prior decay for non-stationarity. Plug-and-play for GRPO/GSPO.</p>
  <p class="key-point">Most similar to our per-problem Thompson sampling. Their prior decay mechanism handles policy evolution &mdash; something our EMA alpha partially addresses.</p>
 </div>

 <!-- ───────────────────────── MAPPING TABLE ───────────────────────── -->
 <h2>Mapping to Our Experiments</h2>

 <table>
  <thead>
    <tr><th>Our Approach</th><th>Closest Paper</th><th>Key Difference</th></tr>
  </thead>
  <tbody>
    <tr>
      <td><code>DifficultyTracker</code> (EMA + UCB)</td>
      <td>VCRL, DOTS</td>
      <td>DOTS estimates difficulty <em>without</em> rolling out every prompt</td>
    </tr>
    <tr>
      <td>Rejection sampling (zero-var filter)</td>
      <td>DEPO</td>
      <td>DEPO filters <em>before</em> rollout, saving vLLM compute</td>
    </tr>
    <tr>
      <td><code>BufferStore</code> replay</td>
      <td>DOTS replay buffer</td>
      <td>Very similar; DOTS uses FIFO, we use age + replay count</td>
    </tr>
    <tr>
      <td>Fixed <code>group_size=8</code></td>
      <td>CurES, Reinforce-Ada</td>
      <td><strong>Variable group sizes per prompt</strong> &mdash; more for hard</td>
    </tr>
    <tr>
      <td><code>ClusterScorer</code> (Thompson)</td>
      <td>DoReMi (minimax)</td>
      <td>Worst-case optimization vs variance-proportional sampling</td>
    </tr>
    <tr>
      <td><code>type_level</code> wins, <code>level</code> hurts</td>
      <td>"Hard Examples" paper</td>
      <td>Content structure &gt; difficulty alone for clustering</td>
    </tr>
    <tr>
      <td>k=15 embed clusters &asymp; type_level</td>
      <td><em>No direct analog</em></td>
      <td>Optimal k matching metadata cluster count appears novel</td>
    </tr>
  </tbody>
 </table>

 <!-- ───────────────────────── ACTIONABLE ───────────────────────── -->
 <h2>Most Actionable Ideas</h2>

 <div class="action-item">
  <div class="action-num">1</div>
  <div class="action-body">
    <strong>Variable group sizes</strong> &mdash; Instead of <code>group_size=8</code> for all prompts, allocate 12&ndash;16 rollouts for frontier problems and 4 for easy ones. Same total vLLM budget, better gradient signal. Orthogonal to cluster work.
    <div class="source">CurES, Reinforce-Ada</div>
  </div>
 </div>

 <div class="action-item">
  <div class="action-num">2</div>
  <div class="action-body">
    <strong>Pre-rollout filtering</strong> &mdash; Train a cheap classifier on (problem_embedding, step) &rarr; predicted reward variance. Skip problems predicted all-correct or all-wrong <em>before</em> sending to vLLM. Reduces wasted <code>max_gen_rounds</code>.
    <div class="source">DEPO</div>
  </div>
 </div>

 <div class="action-item">
  <div class="action-num">3</div>
  <div class="action-body">
    <strong>Minimax cluster reweighting</strong> &mdash; Instead of Thompson on cluster variance, upweight clusters with the largest gap between current and baseline performance. May fix <code>cluster_level</code> underperformance (5 buckets too few for Thompson, but minimax targets weakest directly).
    <div class="source">DoReMi</div>
  </div>
 </div>

 <div class="action-item">
  <div class="action-num">4</div>
  <div class="action-body">
    <strong>Within-group distinctiveness masking</strong> &mdash; When all 8 outputs are near-identical (common at high accuracy), mask redundant pairs to prevent gradient conflicts. Cheap: just compare output similarity.
    <div class="source">DaGRPO</div>
  </div>
 </div>

 <div class="action-item">
  <div class="action-num">5</div>
  <div class="action-body">
    <strong>Advantage reweighting by difficulty</strong> &mdash; Upweight advantages on harder problems in the loss, complementing proposal-side changes. Can combine with length regularization we already have.
    <div class="source">GRPO-LEAD</div>
  </div>
 </div>

 <p style="margin-top: 2rem; color: var(--text-muted); font-size: 0.8rem; text-align: center;">
  Generated Feb 2026 &mdash; lora-without-regret project
 </p>

 </body>
 </html>
	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>Literature Review: Curriculum Learning & Proposal Distributions for GRPO</title>
	<style>
	:root {
	--bg: #0d1117;
	--surface: #161b22;
	--border: #30363d;
	--text: #e6edf3;
	--text-muted: #8b949e;
	--accent: #58a6ff;
	--accent2: #3fb950;
	--accent3: #d2a8ff;
	--accent4: #f0883e;
	--accent5: #f85149;
	}
	* { margin: 0; padding: 0; box-sizing: border-box; }
	body {
	font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif;
	background: var(--bg);
	color: var(--text);
	line-height: 1.6;
	padding: 2rem;
	max-width: 1100px;
	margin: 0 auto;
	}
	h1 {
	font-size: 1.8rem;
	margin-bottom: 0.5rem;
	color: var(--accent);
	}
	.subtitle {
	color: var(--text-muted);
	margin-bottom: 2rem;
	font-size: 0.95rem;
	}
	h2 {
	font-size: 1.3rem;
	margin: 2rem 0 1rem;
	padding-bottom: 0.4rem;
	border-bottom: 1px solid var(--border);
	color: var(--accent3);
	}
	h3 {
	font-size: 1.05rem;
	margin: 1.2rem 0 0.4rem;
	color: var(--accent2);
	}
	p, li { color: var(--text); font-size: 0.92rem; }
	p { margin-bottom: 0.6rem; }
	a { color: var(--accent); text-decoration: none; }
	a:hover { text-decoration: underline; }

	.callout {
	background: var(--surface);
	border-left: 3px solid var(--accent);
	padding: 0.8rem 1rem;
	margin: 1rem 0;
	border-radius: 0 6px 6px 0;
	}
	.callout.warn { border-left-color: var(--accent4); }
	.callout.good { border-left-color: var(--accent2); }
	.callout strong { color: var(--accent); }
	.callout.warn strong { color: var(--accent4); }
	.callout.good strong { color: var(--accent2); }

	.paper {
	background: var(--surface);
	border: 1px solid var(--border);
	border-radius: 8px;
	padding: 1rem 1.2rem;
	margin: 0.8rem 0;
	}
	.paper-title {
	font-weight: 600;
	color: var(--accent2);
	font-size: 1rem;
	}
	.paper-link {
	font-size: 0.82rem;
	color: var(--text-muted);
	margin-bottom: 0.5rem;
	}
	.paper-tags {
	display: flex;
	gap: 0.4rem;
	flex-wrap: wrap;
	margin: 0.5rem 0;
	}
	.tag {
	font-size: 0.72rem;
	padding: 2px 8px;
	border-radius: 12px;
	font-weight: 500;
	}
	.tag.difficulty { background: #1f3a2a; color: #3fb950; }
	.tag.gradient { background: #2a1f3a; color: #d2a8ff; }
	.tag.content { background: #3a2a1f; color: #f0883e; }
	.tag.venue { background: #1f2a3a; color: #58a6ff; }
	.tag.result { background: #3a1f1f; color: #f85149; }

	.detail { color: var(--text-muted); font-size: 0.88rem; }
	.key-point { margin: 0.3rem 0 0.3rem 1rem; font-size: 0.88rem; }
	.key-point::before { content: "→ "; color: var(--accent4); font-weight: bold; }

	table {
	width: 100%;
	border-collapse: collapse;
	margin: 1rem 0;
	font-size: 0.85rem;
	}
	th, td {
	padding: 0.5rem 0.7rem;
	text-align: left;
	border: 1px solid var(--border);
	}
	th { background: var(--surface); color: var(--accent3); font-weight: 600; }
	td { color: var(--text); }
	tr:nth-child(even) td { background: rgba(22, 27, 34, 0.5); }

	.action-item {
	background: var(--surface);
	border: 1px solid var(--border);
	border-radius: 8px;
	padding: 0.8rem 1rem;
	margin: 0.6rem 0;
	display: flex;
	gap: 0.8rem;
	align-items: flex-start;
	}
	.action-num {
	background: var(--accent);
	color: var(--bg);
	font-weight: 700;
	width: 28px;
	height: 28px;
	border-radius: 50%;
	display: flex;
	align-items: center;
	justify-content: center;
	flex-shrink: 0;
	font-size: 0.85rem;
	}
	.action-body { flex: 1; }
	.action-body strong { color: var(--accent); }
	.action-body .source { font-size: 0.8rem; color: var(--text-muted); }
	</style>
	</head>
	<body>

	<h1>Curriculum Learning & Proposal Distributions for GRPO</h1>
	<p class="subtitle">Literature review — Feb 2026. Connections to our cluster/priority sampling experiments noted throughout.</p>

	<div class="callout">
	<strong>Core problem:</strong> Under uniform sampling, GRPO wastes most compute. Easy problems → all-correct groups (zero gradient). Hard problems → all-wrong groups (zero gradient). As training progresses, informative prompts shrink to ~10% of the pool.
	</div>

	<!-- ───────────────────────── SECTION 2 ───────────────────────── -->
	<h2>Difficulty-Only Approaches</h2>
	<p class="detail">Closest to our <code>DifficultyTracker</code> (EMA reward variance + UCB scoring).</p>

	<div class="paper">
	<div class="paper-title">VCRL: Variance-based Curriculum Reinforcement Learning</div>
	<div class="paper-link"><a href="https://arxiv.org/abs/2509.19803">arxiv.org/abs/2509.19803</a></div>
	<div class="paper-tags">
	<span class="tag difficulty">difficulty</span>
	<span class="tag venue">2025</span>
	</div>
	<p>Uses reward variance directly as curriculum signal. Moderate-variance samples are optimal. Dynamically shifts sampling toward the "frontier."</p>
	<p class="key-point">Validates that reward variance (what our <code>DifficultyTracker</code> uses) is the right signal.</p>
	<p class="key-point">Beats GRPO/DAPO/GSPO on 5 math benchmarks.</p>
	</div>

	<div class="paper">
	<div class="paper-title">DOTS: Difficulty-targeted Online Data Selection & Rollout Replay</div>
	<div class="paper-link"><a href="https://arxiv.org/abs/2506.05316">arxiv.org/abs/2506.05316</a></div>
	<div class="paper-tags">
	<span class="tag difficulty">difficulty</span>
	<span class="tag venue">NeurIPS 2025</span>
	<span class="tag result">23-62% speedup</span>
	</div>
	<p>Targets ~0.5 pass rate (information-theoretic sweet spot). Estimates difficulty via attention-based similarity on a small reference set — avoids rolling out every candidate. Bounded FIFO replay buffer.</p>
	<p class="key-point">Their replay buffer is very similar to our <code>BufferStore</code> (age-based eviction). Their difficulty estimation without rollouts could save vLLM compute vs our roll-out-then-reject approach.</p>
	</div>

	<div class="paper">
	<div class="paper-title">Hard Examples Are All You Need</div>
	<div class="paper-link"><a href="https://arxiv.org/abs/2508.14094">arxiv.org/abs/2508.14094</a></div>
	<div class="paper-tags">
	<span class="tag difficulty">difficulty</span>
	<span class="tag result">up to +47%</span>
	</div>
	<p>Contrarian take: training on the <strong>hardest 10%</strong> (by base model pass rate) yields up to 47% gains. Hard examples maintain mixed outcomes throughout training; easy ones plateau.</p>
	<p class="key-point">Interesting tension with our finding that <code>cluster_level</code> (difficulty-only clustering with 5 buckets) consistently hurt. Possibly too coarse, or EMA difficulty ≠ base-model difficulty.</p>
	</div>

	<div class="paper">
	<div class="paper-title">DEPO: Difficulty-Estimated Policy Optimization</div>
	<div class="paper-link"><a href="https://arxiv.org/abs/2602.06375">arxiv.org/abs/2602.06375</a></div>
	<div class="paper-tags">
	<span class="tag difficulty">difficulty</span>
	<span class="tag result">2x cost reduction</span>
	</div>
	<p>Lightweight online difficulty estimator filters <strong>before rollout</strong>. Avoids wasting vLLM compute on zero-variance groups.</p>
	<p class="key-point">Our rejection sampling filters <em>after</em> generation. DEPO's pre-rollout filtering could reduce wasted <code>max_gen_rounds</code>.</p>
	</div>

	<!-- ───────────────────────── SECTION 3 ───────────────────────── -->
	<h2>Gradient-Aware Approaches</h2>
	<p class="detail">Goes deeper than reward variance — analyzes the actual gradient signal.</p>

	<div class="paper">
	<div class="paper-title">CurES: From Gradient Analysis to Efficient Curriculum Learning</div>
	<div class="paper-link"><a href="https://arxiv.org/abs/2510.01037">arxiv.org/abs/2510.01037</a></div>
	<div class="paper-tags">
	<span class="tag gradient">gradient-aware</span>
	<span class="tag result">+3.3pt (1.5B), +4.8pt (7B)</span>
	</div>
	<p>Most theoretically grounded. Gradient analysis reveals <strong>two independent levers</strong>:</p>
	<ol style="margin: 0.3rem 0 0.3rem 1.5rem; font-size: 0.88rem;">
	<li><strong>Prompt sampling distribution</strong> dictates convergence rate (which problems)</li>
	<li><strong>Rollout quantity allocation</strong> per prompt affects gradient stability (how many samples)</li>
	</ol>
	<p>Uses Bayesian posterior estimation to set both.</p>
	<div class="callout warn">
	<strong>Key insight for us:</strong> We use fixed <code>group_size=8</code> for all prompts. CurES says allocate more rollouts to frontier problems, fewer to easy/impossible ones. This is orthogonal to cluster work.
	</div>
	</div>

	<div class="paper">
	<div class="paper-title">Reinforce-Ada: Adaptive Sampling Framework</div>
	<div class="paper-link"><a href="https://arxiv.org/abs/2510.04996">arxiv.org/abs/2510.04996</a></div>
	<div class="paper-tags">
	<span class="tag gradient">gradient-aware</span>
	<span class="tag result">2x convergence speedup</span>
	</div>
	<p>Optimal per-prompt group size is proportional to difficulty. Signal loss at high accuracy is from <strong>undersampling</strong>, not fundamental. Larger groups for hard prompts recover learning signals.</p>
	<p class="key-point">Same total inference budget, 2x convergence. Variable group sizes could be a big win.</p>
	</div>

	<div class="paper">
	<div class="paper-title">DaGRPO: Distinctiveness-Aware GRPO</div>
	<div class="paper-link"><a href="https://arxiv.org/abs/2512.06337">arxiv.org/abs/2512.06337</a></div>
	<div class="paper-tags">
	<span class="tag gradient">gradient-aware</span>
	<span class="tag result">+4.7% math</span>
	</div>
	<p>Even within a group, homogeneous outputs create <strong>gradient conflicts</strong>. Masks sample pairs with low "distinctiveness" and augments hard problems with off-policy positives.</p>
	<p class="key-point">At 84% accuracy (our 8B run), most groups are all-correct with near-identical outputs. DaGRPO masking could help.</p>
	</div>

	<!-- ───────────────────────── SECTION 4 ───────────────────────── -->
	<h2>Content-Aware & Domain-Weighting Approaches</h2>
	<p class="detail">Closest to our <code>ClusterScorer</code> / Thompson sampling on clusters.</p>

	<div class="paper">
	<div class="paper-title">GRPO-LEAD: Difficulty-Aware RL + Length Regularization</div>
	<div class="paper-link"><a href="https://arxiv.org/abs/2504.09696">arxiv.org/abs/2504.09696</a></div>
	<div class="paper-tags">
	<span class="tag content">content-aware</span>
	<span class="tag venue">EMNLP 2025</span>
	</div>
	<p>Upweights advantages on harder problems in the <strong>loss computation</strong> (not sampling stage). Length-regularized rewards for conciseness.</p>
	<p class="key-point">Our reward shaping (think_close_penalty, length_penalty) is similar to their length regularization. Their advantage reweighting is complementary — modify the loss, not the proposal distribution.</p>
	</div>

	<div class="paper">
	<div class="paper-title">DoReMi: Optimizing Data Mixtures via Minimax</div>
	<div class="paper-link"><a href="https://arxiv.org/abs/2305.10429">arxiv.org/abs/2305.10429</a></div>
	<div class="paper-tags">
	<span class="tag content">domain-weighting</span>
	<span class="tag venue">NeurIPS 2023</span>
	</div>
	<p>Pretraining analog of our cluster Thompson sampling. Group DRO across domains: <strong>minimax over worst-case excess loss</strong>. Upweights domains where the model has the largest gap between current and reference model.</p>
	<div class="callout warn">
	<strong>Key insight for us:</strong> Might fix <code>cluster_level</code> underperformance. Thompson sampling on 5 buckets can't converge in 20 steps, but minimax would directly target the weakest cluster without needing to estimate variance.
	</div>
	</div>

	<div class="paper">
	<div class="paper-title">VADE: Variance-Aware Dynamic Sampling (Beta + Thompson)</div>
	<div class="paper-link"><a href="https://arxiv.org/abs/2511.18902">arxiv.org/abs/2511.18902</a></div>
	<div class="paper-tags">
	<span class="tag difficulty">difficulty</span>
	<span class="tag content">content-aware</span>
	</div>
	<p>Online per-sample difficulty via Beta distributions + Thompson sampling + two-scale prior decay for non-stationarity. Plug-and-play for GRPO/GSPO.</p>
	<p class="key-point">Most similar to our per-problem Thompson sampling. Their prior decay mechanism handles policy evolution — something our EMA alpha partially addresses.</p>
	</div>

	<!-- ───────────────────────── MAPPING TABLE ───────────────────────── -->
	<h2>Mapping to Our Experiments</h2>

	<table>
	<thead>
	<tr><th>Our Approach</th><th>Closest Paper</th><th>Key Difference</th></tr>
	</thead>
	<tbody>
	<tr>
	<td><code>DifficultyTracker</code> (EMA + UCB)</td>
	<td>VCRL, DOTS</td>
	<td>DOTS estimates difficulty <em>without</em> rolling out every prompt</td>
	</tr>
	<tr>
	<td>Rejection sampling (zero-var filter)</td>
	<td>DEPO</td>
	<td>DEPO filters <em>before</em> rollout, saving vLLM compute</td>
	</tr>
	<tr>
	<td><code>BufferStore</code> replay</td>
	<td>DOTS replay buffer</td>
	<td>Very similar; DOTS uses FIFO, we use age + replay count</td>
	</tr>
	<tr>
	<td>Fixed <code>group_size=8</code></td>
	<td>CurES, Reinforce-Ada</td>
	<td><strong>Variable group sizes per prompt</strong> — more for hard</td>
	</tr>
	<tr>
	<td><code>ClusterScorer</code> (Thompson)</td>
	<td>DoReMi (minimax)</td>
	<td>Worst-case optimization vs variance-proportional sampling</td>
	</tr>
	<tr>
	<td><code>type_level</code> wins, <code>level</code> hurts</td>
	<td>"Hard Examples" paper</td>
	<td>Content structure > difficulty alone for clustering</td>
	</tr>
	<tr>
	<td>k=15 embed clusters ≈ type_level</td>
	<td><em>No direct analog</em></td>
	<td>Optimal k matching metadata cluster count appears novel</td>
	</tr>
	</tbody>
	</table>

	<!-- ───────────────────────── ACTIONABLE ───────────────────────── -->
	<h2>Most Actionable Ideas</h2>

	<div class="action-item">
	<div class="action-num">1</div>
	<div class="action-body">
	<strong>Variable group sizes</strong> — Instead of <code>group_size=8</code> for all prompts, allocate 12–16 rollouts for frontier problems and 4 for easy ones. Same total vLLM budget, better gradient signal. Orthogonal to cluster work.
	<div class="source">CurES, Reinforce-Ada</div>
	</div>
	</div>

	<div class="action-item">
	<div class="action-num">2</div>
	<div class="action-body">
	<strong>Pre-rollout filtering</strong> — Train a cheap classifier on (problem_embedding, step) → predicted reward variance. Skip problems predicted all-correct or all-wrong <em>before</em> sending to vLLM. Reduces wasted <code>max_gen_rounds</code>.
	<div class="source">DEPO</div>
	</div>
	</div>

	<div class="action-item">
	<div class="action-num">3</div>
	<div class="action-body">
	<strong>Minimax cluster reweighting</strong> — Instead of Thompson on cluster variance, upweight clusters with the largest gap between current and baseline performance. May fix <code>cluster_level</code> underperformance (5 buckets too few for Thompson, but minimax targets weakest directly).
	<div class="source">DoReMi</div>
	</div>
	</div>

	<div class="action-item">
	<div class="action-num">4</div>
	<div class="action-body">
	<strong>Within-group distinctiveness masking</strong> — When all 8 outputs are near-identical (common at high accuracy), mask redundant pairs to prevent gradient conflicts. Cheap: just compare output similarity.
	<div class="source">DaGRPO</div>
	</div>
	</div>

	<div class="action-item">
	<div class="action-num">5</div>
	<div class="action-body">
	<strong>Advantage reweighting by difficulty</strong> — Upweight advantages on harder problems in the loss, complementing proposal-side changes. Can combine with length regularization we already have.
	<div class="source">GRPO-LEAD</div>
	</div>
	</div>

	<p style="margin-top: 2rem; color: var(--text-muted); font-size: 0.8rem; text-align: center;">
	Generated Feb 2026 — lora-without-regret project
	</p>

	</body>
	</html>
No results found