Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Qianfan, Guo, Tianyu, Ren, Xuandi, Chen, Jiale, Ding, Ming, Xin, Ran, Xiao, Xia
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2604.01302
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912996404494336
author	Zhang, Qianfan Guo, Tianyu Ren, Xuandi Chen, Jiale Ding, Ming Xin, Ran Xiao, Xia
author_facet	Zhang, Qianfan Guo, Tianyu Ren, Xuandi Chen, Jiale Ding, Ming Xin, Ran Xiao, Xia
contents	We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model's oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_01302
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming Zhang, Qianfan Guo, Tianyu Ren, Xuandi Chen, Jiale Ding, Ming Xin, Ran Xiao, Xia Computation and Language We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model's oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.
title	Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming
topic	Computation and Language
url	https://arxiv.org/abs/2604.01302

Similar Items