Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mai, Xinji, Xu, Haotian, Li, Zhong-Zhi, W, Xing, Wang, Weinong, Hu, Jian, Zhang, Yingying, Zhang, Wenqiang
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.07773
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916909661814784
author	Mai, Xinji Xu, Haotian Li, Zhong-Zhi W, Xing Wang, Weinong Hu, Jian Zhang, Yingying Zhang, Wenqiang
author_facet	Mai, Xinji Xu, Haotian Li, Zhong-Zhi W, Xing Wang, Weinong Hu, Jian Zhang, Yingying Zhang, Wenqiang
contents	Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial. We investigate RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. Our central contribution is we demonstrate that as RL training progresses, key metrics scale predictably. Specifically, we observe strong positive correlations where increased training steps lead to increases in the spontaneous code execution frequency, the average response length, and, critically, the final task accuracy. This suggests a quantifiable relationship between computational effort invested in training and the emergence of effective, tool-augmented reasoning strategies. We implement a robust framework featuring a decoupled code execution environment and validate our findings across standard RL algorithms and frameworks. Experiments show ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks. Our findings provide a foundational understanding of how autonomous tool use is acquired and scales within Agent RL, offering a reproducible benchmark for future studies. Code is released at \href{https://github.com/yyht/openrlhf_async_pipline}{https://github.com/yyht/openrlhf\_async\_pipline}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_07773
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving Mai, Xinji Xu, Haotian Li, Zhong-Zhi W, Xing Wang, Weinong Hu, Jian Zhang, Yingying Zhang, Wenqiang Artificial Intelligence Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial. We investigate RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. Our central contribution is we demonstrate that as RL training progresses, key metrics scale predictably. Specifically, we observe strong positive correlations where increased training steps lead to increases in the spontaneous code execution frequency, the average response length, and, critically, the final task accuracy. This suggests a quantifiable relationship between computational effort invested in training and the emergence of effective, tool-augmented reasoning strategies. We implement a robust framework featuring a decoupled code execution environment and validate our findings across standard RL algorithms and frameworks. Experiments show ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks. Our findings provide a foundational understanding of how autonomous tool use is acquired and scales within Agent RL, offering a reproducible benchmark for future studies. Code is released at \href{https://github.com/yyht/openrlhf_async_pipline}{https://github.com/yyht/openrlhf\_async\_pipline}.
title	Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving
topic	Artificial Intelligence
url	https://arxiv.org/abs/2505.07773

Similar Items