Saved in:
Bibliographic Details
Main Authors: Fan, Shengda, Ye, Xuyan, Huo, Yupeng, Chen, Zhi-Yuan, Guo, Yiju, Yang, Shenzhi, Yang, Wenkai, Ye, Shuqi, Chen, Jingwen, Chen, Haotian, Cong, Xin, Lin, Yankai
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.14465
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914622587535360
author Fan, Shengda
Ye, Xuyan
Huo, Yupeng
Chen, Zhi-Yuan
Guo, Yiju
Yang, Shenzhi
Yang, Wenkai
Ye, Shuqi
Chen, Jingwen
Chen, Haotian
Cong, Xin
Lin, Yankai
author_facet Fan, Shengda
Ye, Xuyan
Huo, Yupeng
Chen, Zhi-Yuan
Guo, Yiju
Yang, Shenzhi
Yang, Wenkai
Ye, Shuqi
Chen, Jingwen
Chen, Haotian
Cong, Xin
Lin, Yankai
contents While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.
format Preprint
id arxiv_https___arxiv_org_abs_2603_14465
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents
Fan, Shengda
Ye, Xuyan
Huo, Yupeng
Chen, Zhi-Yuan
Guo, Yiju
Yang, Shenzhi
Yang, Wenkai
Ye, Shuqi
Chen, Jingwen
Chen, Haotian
Cong, Xin
Lin, Yankai
Artificial Intelligence
While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.
title AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents
topic Artificial Intelligence
url https://arxiv.org/abs/2603.14465