Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.20532 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908849864179712 |
|---|---|
| author | Gu, Zhengyao Light, Jonathan Astudillo, Raul Ye, Ziyu He, Langzhou Zou, Henry Peng Cheng, Wei Paternain, Santiago Yu, Philip S. Yue, Yisong |
| author_facet | Gu, Zhengyao Light, Jonathan Astudillo, Raul Ye, Ziyu He, Langzhou Zou, Henry Peng Cheng, Wei Paternain, Santiago Yu, Philip S. Yue, Yisong |
| contents | Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_20532 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training Gu, Zhengyao Light, Jonathan Astudillo, Raul Ye, Ziyu He, Langzhou Zou, Henry Peng Cheng, Wei Paternain, Santiago Yu, Philip S. Yue, Yisong Machine Learning Artificial Intelligence Computation and Language 68T05, 68T20, 90C40, 62L05 I.2.6; I.2.8; I.2.1; F.1.1 Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training. |
| title | Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training |
| topic | Machine Learning Artificial Intelligence Computation and Language 68T05, 68T20, 90C40, 62L05 I.2.6; I.2.8; I.2.1; F.1.1 |
| url | https://arxiv.org/abs/2602.20532 |