Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cheng, Zhoujun, Hao, Shibo, Liu, Tianyang, Zhou, Fan, Xie, Yutao, Yao, Feng, Bian, Yuexin, Zhuang, Yonghao, Dey, Nilabjo, Zha, Yuheng, Gu, Yi, Zhou, Kun, Wang, Yuqi, Li, Yuan, Fan, Richard, She, Jianshu, Gao, Chengqian, Saparov, Abulhair, Li, Haonan, Killian, Taylor W., Yurochkin, Mikhail, Liu, Zhengzhong, Xing, Eric P., Hu, Zhiting
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2506.14965
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908411429388288
author	Cheng, Zhoujun Hao, Shibo Liu, Tianyang Zhou, Fan Xie, Yutao Yao, Feng Bian, Yuexin Zhuang, Yonghao Dey, Nilabjo Zha, Yuheng Gu, Yi Zhou, Kun Wang, Yuqi Li, Yuan Fan, Richard She, Jianshu Gao, Chengqian Saparov, Abulhair Li, Haonan Killian, Taylor W. Yurochkin, Mikhail Liu, Zhengzhong Xing, Eric P. Hu, Zhiting
author_facet	Cheng, Zhoujun Hao, Shibo Liu, Tianyang Zhou, Fan Xie, Yutao Yao, Feng Bian, Yuexin Zhuang, Yonghao Dey, Nilabjo Zha, Yuheng Gu, Yi Zhou, Kun Wang, Yuqi Li, Yuan Fan, Richard She, Jianshu Gao, Chengqian Saparov, Abulhair Li, Haonan Killian, Taylor W. Yurochkin, Mikhail Liu, Zhengzhong Xing, Eric P. Hu, Zhiting
contents	Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360
format	Preprint
id	arxiv_https___arxiv_org_abs_2506_14965
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective Cheng, Zhoujun Hao, Shibo Liu, Tianyang Zhou, Fan Xie, Yutao Yao, Feng Bian, Yuexin Zhuang, Yonghao Dey, Nilabjo Zha, Yuheng Gu, Yi Zhou, Kun Wang, Yuqi Li, Yuan Fan, Richard She, Jianshu Gao, Chengqian Saparov, Abulhair Li, Haonan Killian, Taylor W. Yurochkin, Mikhail Liu, Zhengzhong Xing, Eric P. Hu, Zhiting Machine Learning Artificial Intelligence Computation and Language Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360
title	Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
topic	Machine Learning Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2506.14965

Similar Items