Saved in:
| Main Authors: | Lei, Fei, Yang, Yibo, Sun, Wenxiu, Lin, Dahua |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.16260 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ToolRM: Towards Agentic Tool-Use Reward Modeling
by: Li, Renhao, et al.
Published: (2025)
by: Li, Renhao, et al.
Published: (2025)
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
by: Yang, Jie, et al.
Published: (2026)
by: Yang, Jie, et al.
Published: (2026)
FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol
by: Zhu, Jie, et al.
Published: (2026)
by: Zhu, Jie, et al.
Published: (2026)
FamilyTool: A Multi-hop Personalized Tool Use Benchmark
by: Wang, Yuxin, et al.
Published: (2025)
by: Wang, Yuxin, et al.
Published: (2025)
Benchmarking LLM Tool-Use in the Wild
by: Yu, Peijie, et al.
Published: (2026)
by: Yu, Peijie, et al.
Published: (2026)
$τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
by: Yao, Shunyu, et al.
Published: (2024)
by: Yao, Shunyu, et al.
Published: (2024)
REALM: A Dataset of Real-World LLM Use Cases
by: Cheng, Jingwen, et al.
Published: (2025)
by: Cheng, Jingwen, et al.
Published: (2025)
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
by: Jiang, Dongfu, et al.
Published: (2025)
by: Jiang, Dongfu, et al.
Published: (2025)
TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments
by: Xu, Zhangchen, et al.
Published: (2025)
by: Xu, Zhangchen, et al.
Published: (2025)
LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability
by: Xiao, Zikai, et al.
Published: (2025)
by: Xiao, Zikai, et al.
Published: (2025)
RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
by: Bian, Haonan, et al.
Published: (2026)
by: Bian, Haonan, et al.
Published: (2026)
Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic
by: Ma, Yichuan, et al.
Published: (2026)
by: Ma, Yichuan, et al.
Published: (2026)
MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
by: Zong, Xuanjun, et al.
Published: (2025)
by: Zong, Xuanjun, et al.
Published: (2025)
SLM-Based Agentic AI with P-C-G: Optimized for Korean Tool Use
by: Jeon, Changhyun, et al.
Published: (2025)
by: Jeon, Changhyun, et al.
Published: (2025)
CCTU: A Benchmark for Tool Use under Complex Constraints
by: Ye, Junjie, et al.
Published: (2026)
by: Ye, Junjie, et al.
Published: (2026)
D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies
by: Chen, Sen, et al.
Published: (2025)
by: Chen, Sen, et al.
Published: (2025)
StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7,000+ Real-World APIs
by: Guo, Zhicheng, et al.
Published: (2025)
by: Guo, Zhicheng, et al.
Published: (2025)
Agentic Reinforcement Learning for Real-World Code Repair
by: Zhu, Siyu, et al.
Published: (2025)
by: Zhu, Siyu, et al.
Published: (2025)
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios
by: Wu, Junchao, et al.
Published: (2024)
by: Wu, Junchao, et al.
Published: (2024)
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
by: Wang, Jize, et al.
Published: (2026)
by: Wang, Jize, et al.
Published: (2026)
Flames: Benchmarking Value Alignment of LLMs in Chinese
by: Huang, Kexin, et al.
Published: (2023)
by: Huang, Kexin, et al.
Published: (2023)
Balanced Data Sampling for Language Model Training with Clustering
by: Shao, Yunfan, et al.
Published: (2024)
by: Shao, Yunfan, et al.
Published: (2024)
Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems
by: Shukla, Manish
Published: (2025)
by: Shukla, Manish
Published: (2025)
RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking
by: Yang, Shuo, et al.
Published: (2025)
by: Yang, Shuo, et al.
Published: (2025)
Consultant Decoding: Yet Another Synergistic Mechanism
by: Ding, Chuanghao, et al.
Published: (2025)
by: Ding, Chuanghao, et al.
Published: (2025)
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
by: Li, Zhuofeng, et al.
Published: (2025)
by: Li, Zhuofeng, et al.
Published: (2025)
DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints
by: Zhang, Yinger, et al.
Published: (2026)
by: Zhang, Yinger, et al.
Published: (2026)
Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools
by: Wu, Junde, et al.
Published: (2025)
by: Wu, Junde, et al.
Published: (2025)
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
by: Chi, Yizhe, et al.
Published: (2026)
by: Chi, Yizhe, et al.
Published: (2026)
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
by: Zhou, Ruiwen, et al.
Published: (2024)
by: Zhou, Ruiwen, et al.
Published: (2024)
Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials
by: Fang, Ye, et al.
Published: (2024)
by: Fang, Ye, et al.
Published: (2024)
ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
by: Zhang, Yuxiang, et al.
Published: (2024)
by: Zhang, Yuxiang, et al.
Published: (2024)
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios
by: Qi, Yunjia, et al.
Published: (2025)
by: Qi, Yunjia, et al.
Published: (2025)
PyVision: Agentic Vision with Dynamic Tooling
by: Zhao, Shitian, et al.
Published: (2025)
by: Zhao, Shitian, et al.
Published: (2025)
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
by: Zhang, Xiaoyi, et al.
Published: (2025)
by: Zhang, Xiaoyi, et al.
Published: (2025)
ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities
by: Lu, Jiarui, et al.
Published: (2024)
by: Lu, Jiarui, et al.
Published: (2024)
Can a Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model
by: Acikgoz, Emre Can, et al.
Published: (2025)
by: Acikgoz, Emre Can, et al.
Published: (2025)
Large Language Models Can Solve Real-World Planning Rigorously with Formal Verification Tools
by: Hao, Yilun, et al.
Published: (2024)
by: Hao, Yilun, et al.
Published: (2024)
PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice
by: Shi, Yuzhen, et al.
Published: (2026)
by: Shi, Yuzhen, et al.
Published: (2026)
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models
by: Zhang, YiFan, et al.
Published: (2024)
by: Zhang, YiFan, et al.
Published: (2024)
Similar Items
-
ToolRM: Towards Agentic Tool-Use Reward Modeling
by: Li, Renhao, et al.
Published: (2025) -
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
by: Yang, Jie, et al.
Published: (2026) -
FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol
by: Zhu, Jie, et al.
Published: (2026) -
FamilyTool: A Multi-hop Personalized Tool Use Benchmark
by: Wang, Yuxin, et al.
Published: (2025) -
Benchmarking LLM Tool-Use in the Wild
by: Yu, Peijie, et al.
Published: (2026)