Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Gu, Zhengyao, Light, Jonathan, Astudillo, Raul, Ye, Ziyu, He, Langzhou, Zou, Henry Peng, Cheng, Wei, Paternain, Santiago, Yu, Philip S., Yue, Yisong
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence Computation and Language 68T05, 68T20, 90C40, 62L05 I.2.6; I.2.8; I.2.1; F.1.1
Online Access:	https://arxiv.org/abs/2602.20532
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908849864179712
author	Gu, Zhengyao Light, Jonathan Astudillo, Raul Ye, Ziyu He, Langzhou Zou, Henry Peng Cheng, Wei Paternain, Santiago Yu, Philip S. Yue, Yisong
author_facet	Gu, Zhengyao Light, Jonathan Astudillo, Raul Ye, Ziyu He, Langzhou Zou, Henry Peng Cheng, Wei Paternain, Santiago Yu, Philip S. Yue, Yisong
contents	Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_20532
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training Gu, Zhengyao Light, Jonathan Astudillo, Raul Ye, Ziyu He, Langzhou Zou, Henry Peng Cheng, Wei Paternain, Santiago Yu, Philip S. Yue, Yisong Machine Learning Artificial Intelligence Computation and Language 68T05, 68T20, 90C40, 62L05 I.2.6; I.2.8; I.2.1; F.1.1 Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.
title	Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training
topic	Machine Learning Artificial Intelligence Computation and Language 68T05, 68T20, 90C40, 62L05 I.2.6; I.2.8; I.2.1; F.1.1
url	https://arxiv.org/abs/2602.20532

Similar Items