:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Lei, Fei, Yang, Yibo, Sun, Wenxiu, Lin, Dahua
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2508.16260
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ToolRM: Towards Agentic Tool-Use Reward Modeling
by: Li, Renhao, et al.
Published: (2025)

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
by: Yang, Jie, et al.
Published: (2026)

FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol
by: Zhu, Jie, et al.
Published: (2026)

FamilyTool: A Multi-hop Personalized Tool Use Benchmark
by: Wang, Yuxin, et al.
Published: (2025)

Benchmarking LLM Tool-Use in the Wild
by: Yu, Peijie, et al.
Published: (2026)

$τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
by: Yao, Shunyu, et al.
Published: (2024)

REALM: A Dataset of Real-World LLM Use Cases
by: Cheng, Jingwen, et al.
Published: (2025)

VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
by: Jiang, Dongfu, et al.
Published: (2025)

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments
by: Xu, Zhangchen, et al.
Published: (2025)

LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability
by: Xiao, Zikai, et al.
Published: (2025)

RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
by: Bian, Haonan, et al.
Published: (2026)

Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic
by: Ma, Yichuan, et al.
Published: (2026)

MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
by: Zong, Xuanjun, et al.
Published: (2025)

SLM-Based Agentic AI with P-C-G: Optimized for Korean Tool Use
by: Jeon, Changhyun, et al.
Published: (2025)

CCTU: A Benchmark for Tool Use under Complex Constraints
by: Ye, Junjie, et al.
Published: (2026)

D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies
by: Chen, Sen, et al.
Published: (2025)

StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7,000+ Real-World APIs
by: Guo, Zhicheng, et al.
Published: (2025)

Agentic Reinforcement Learning for Real-World Code Repair
by: Zhu, Siyu, et al.
Published: (2025)

DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios
by: Wu, Junchao, et al.
Published: (2024)

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
by: Wang, Jize, et al.
Published: (2026)

Flames: Benchmarking Value Alignment of LLMs in Chinese
by: Huang, Kexin, et al.
Published: (2023)

Balanced Data Sampling for Language Model Training with Clustering
by: Shao, Yunfan, et al.
Published: (2024)

Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems
by: Shukla, Manish
Published: (2025)

RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking
by: Yang, Shuo, et al.
Published: (2025)

Consultant Decoding: Yet Another Synergistic Mechanism
by: Ding, Chuanghao, et al.
Published: (2025)

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
by: Li, Zhuofeng, et al.
Published: (2025)

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints
by: Zhang, Yinger, et al.
Published: (2026)

Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools
by: Wu, Junde, et al.
Published: (2025)

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
by: Chi, Yizhe, et al.
Published: (2026)

RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
by: Zhou, Ruiwen, et al.
Published: (2024)

Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials
by: Fang, Ye, et al.
Published: (2024)

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
by: Zhang, Yuxiang, et al.
Published: (2024)

AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios
by: Qi, Yunjia, et al.
Published: (2025)

PyVision: Agentic Vision with Dynamic Tooling
by: Zhao, Shitian, et al.
Published: (2025)

Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
by: Zhang, Xiaoyi, et al.
Published: (2025)

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities
by: Lu, Jiarui, et al.
Published: (2024)

Can a Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model
by: Acikgoz, Emre Can, et al.
Published: (2025)

Large Language Models Can Solve Real-World Planning Rigorously with Formal Verification Tools
by: Hao, Yilun, et al.
Published: (2024)

PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice
by: Shi, Yuzhen, et al.
Published: (2026)

Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models
by: Zhang, YiFan, et al.
Published: (2024)