Saved in:
Bibliographic Details
Main Authors: Cheng, Zhoujun, Hao, Shibo, Liu, Tianyang, Zhou, Fan, Xie, Yutao, Yao, Feng, Bian, Yuexin, Zhuang, Yonghao, Dey, Nilabjo, Zha, Yuheng, Gu, Yi, Zhou, Kun, Wang, Yuqi, Li, Yuan, Fan, Richard, She, Jianshu, Gao, Chengqian, Saparov, Abulhair, Li, Haonan, Killian, Taylor W., Yurochkin, Mikhail, Liu, Zhengzhong, Xing, Eric P., Hu, Zhiting
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2506.14965
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908411429388288
author Cheng, Zhoujun
Hao, Shibo
Liu, Tianyang
Zhou, Fan
Xie, Yutao
Yao, Feng
Bian, Yuexin
Zhuang, Yonghao
Dey, Nilabjo
Zha, Yuheng
Gu, Yi
Zhou, Kun
Wang, Yuqi
Li, Yuan
Fan, Richard
She, Jianshu
Gao, Chengqian
Saparov, Abulhair
Li, Haonan
Killian, Taylor W.
Yurochkin, Mikhail
Liu, Zhengzhong
Xing, Eric P.
Hu, Zhiting
author_facet Cheng, Zhoujun
Hao, Shibo
Liu, Tianyang
Zhou, Fan
Xie, Yutao
Yao, Feng
Bian, Yuexin
Zhuang, Yonghao
Dey, Nilabjo
Zha, Yuheng
Gu, Yi
Zhou, Kun
Wang, Yuqi
Li, Yuan
Fan, Richard
She, Jianshu
Gao, Chengqian
Saparov, Abulhair
Li, Haonan
Killian, Taylor W.
Yurochkin, Mikhail
Liu, Zhengzhong
Xing, Eric P.
Hu, Zhiting
contents Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360
format Preprint
id arxiv_https___arxiv_org_abs_2506_14965
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
Cheng, Zhoujun
Hao, Shibo
Liu, Tianyang
Zhou, Fan
Xie, Yutao
Yao, Feng
Bian, Yuexin
Zhuang, Yonghao
Dey, Nilabjo
Zha, Yuheng
Gu, Yi
Zhou, Kun
Wang, Yuqi
Li, Yuan
Fan, Richard
She, Jianshu
Gao, Chengqian
Saparov, Abulhair
Li, Haonan
Killian, Taylor W.
Yurochkin, Mikhail
Liu, Zhengzhong
Xing, Eric P.
Hu, Zhiting
Machine Learning
Artificial Intelligence
Computation and Language
Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360
title Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
topic Machine Learning
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2506.14965