- Ridge acceptance rate: 57% vs baseline 42% (late stage). The adaptive scorer IS better at finding useful problems — 22% fewer wasted generations (589 vs 758 attempts for 320 useful groups).
- The per-level breakdown shows baseline wins at every level, but the adaptive scorer was handicapped by a KL explosion at step 19 (kl=10.2, grad_norm=4.66).
- Eval accuracy: baseline 82.2% > ridge 78.7% > cluster 77.6%. Better exploration didn't translate to better training.