Saved in:
Bibliographic Details
Main Authors: Cheng, Zhoujun, Xie, Yutao, Qu, Yuxiao, Setlur, Amrith, Hao, Shibo, Pimpalkhute, Varad, Liang, Tongtong, Yao, Feng, Liu, Zhengzhong, Xing, Eric, Smith, Virginia, Salakhutdinov, Ruslan, Hu, Zhiting, Killian, Taylor, Kumar, Aviral
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.12151
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918384456695808
author Cheng, Zhoujun
Xie, Yutao
Qu, Yuxiao
Setlur, Amrith
Hao, Shibo
Pimpalkhute, Varad
Liang, Tongtong
Yao, Feng
Liu, Zhengzhong
Xing, Eric
Smith, Virginia
Salakhutdinov, Ruslan
Hu, Zhiting
Killian, Taylor
Kumar, Aviral
author_facet Cheng, Zhoujun
Xie, Yutao
Qu, Yuxiao
Setlur, Amrith
Hao, Shibo
Pimpalkhute, Varad
Liang, Tongtong
Yao, Feng
Liu, Zhengzhong
Xing, Eric
Smith, Virginia
Salakhutdinov, Ruslan
Hu, Zhiting
Killian, Taylor
Kumar, Aviral
contents While scaling laws guide compute allocation for LLM pre-training, analogous prescriptions for reinforcement learning (RL) post-training of large language models (LLMs) remain poorly understood. We study the compute-optimal allocation of sampling compute for on-policy RL methods in LLMs, framing scaling as a compute-constrained optimization over three resources: parallel rollouts per problem, number of problems per batch, and number of update steps. We find that the compute-optimal number of parallel rollouts per problem increases predictably with compute budget and then saturates. This trend holds across both easy and hard problems, though driven by different mechanisms: solution sharpening on easy problems and coverage expansion on hard problems. We further show that increasing the number of parallel rollouts mitigates interference across problems, while the number of problems per batch primarily affects training stability and can be chosen within a broad range. Validated across base models and data distributions, our results recast RL scaling laws as prescriptive allocation rules and provide practical guidance for compute-efficient LLM RL post-training.
format Preprint
id arxiv_https___arxiv_org_abs_2603_12151
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL
Cheng, Zhoujun
Xie, Yutao
Qu, Yuxiao
Setlur, Amrith
Hao, Shibo
Pimpalkhute, Varad
Liang, Tongtong
Yao, Feng
Liu, Zhengzhong
Xing, Eric
Smith, Virginia
Salakhutdinov, Ruslan
Hu, Zhiting
Killian, Taylor
Kumar, Aviral
Machine Learning
Artificial Intelligence
While scaling laws guide compute allocation for LLM pre-training, analogous prescriptions for reinforcement learning (RL) post-training of large language models (LLMs) remain poorly understood. We study the compute-optimal allocation of sampling compute for on-policy RL methods in LLMs, framing scaling as a compute-constrained optimization over three resources: parallel rollouts per problem, number of problems per batch, and number of update steps. We find that the compute-optimal number of parallel rollouts per problem increases predictably with compute budget and then saturates. This trend holds across both easy and hard problems, though driven by different mechanisms: solution sharpening on easy problems and coverage expansion on hard problems. We further show that increasing the number of parallel rollouts mitigates interference across problems, while the number of problems per batch primarily affects training stability and can be chosen within a broad range. Validated across base models and data distributions, our results recast RL scaling laws as prescriptive allocation rules and provide practical guidance for compute-efficient LLM RL post-training.
title IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2603.12151