Saved in:
Bibliographic Details
Main Authors: Gu, Zhengyao, Light, Jonathan, Astudillo, Raul, Ye, Ziyu, He, Langzhou, Zou, Henry Peng, Cheng, Wei, Paternain, Santiago, Yu, Philip S., Yue, Yisong
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.20532
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908849864179712
author Gu, Zhengyao
Light, Jonathan
Astudillo, Raul
Ye, Ziyu
He, Langzhou
Zou, Henry Peng
Cheng, Wei
Paternain, Santiago
Yu, Philip S.
Yue, Yisong
author_facet Gu, Zhengyao
Light, Jonathan
Astudillo, Raul
Ye, Ziyu
He, Langzhou
Zou, Henry Peng
Cheng, Wei
Paternain, Santiago
Yu, Philip S.
Yue, Yisong
contents Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.
format Preprint
id arxiv_https___arxiv_org_abs_2602_20532
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training
Gu, Zhengyao
Light, Jonathan
Astudillo, Raul
Ye, Ziyu
He, Langzhou
Zou, Henry Peng
Cheng, Wei
Paternain, Santiago
Yu, Philip S.
Yue, Yisong
Machine Learning
Artificial Intelligence
Computation and Language
68T05, 68T20, 90C40, 62L05
I.2.6; I.2.8; I.2.1; F.1.1
Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.
title Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training
topic Machine Learning
Artificial Intelligence
Computation and Language
68T05, 68T20, 90C40, 62L05
I.2.6; I.2.8; I.2.1; F.1.1
url https://arxiv.org/abs/2602.20532