:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Guo, Zikang, Xu, Benfeng, Zhu, Chiwei, Hong, Wentao, Wang, Xiaorui, Mao, Zhendong
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2509.09734
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach
by: Li, Ruizhe, et al.
Published: (2025)

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
by: Du, Mingxuan, et al.
Published: (2025)

MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning
by: Guo, Zikang, et al.
Published: (2025)

AgentBench: Evaluating LLMs as Agents
by: Liu, Xiao, et al.
Published: (2023)

From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding
by: Zhu, Chiwei, et al.
Published: (2025)

DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report
by: Li, Ruizhe, et al.
Published: (2026)

MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
by: Zong, Xuanjun, et al.
Published: (2025)

FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol
by: Zhu, Jie, et al.
Published: (2026)

MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools
by: Wang, Wenhao, et al.
Published: (2025)

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?
by: Mo, Guozhao, et al.
Published: (2025)

FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering
by: Lee, Gyubok, et al.
Published: (2025)

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
by: Yin, Ming, et al.
Published: (2025)

FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents
by: Zhu, Chiwei, et al.
Published: (2026)

MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use
by: Liu, Wenrui, et al.
Published: (2025)

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
by: Wang, Zhenting, et al.
Published: (2025)

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models
by: Liu, Zhiwei, et al.
Published: (2025)

Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles
by: Wang, Shaohan, et al.
Published: (2026)

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments
by: Xu, Zhangchen, et al.
Published: (2025)

MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
by: Luo, Ziyang, et al.
Published: (2025)

A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces
by: Du, Mingxuan, et al.
Published: (2026)

MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments
by: Ganapavarapu, Giridhar, et al.
Published: (2026)

Network and Systems Performance Characterization of MCP-Enabled LLM Agents
by: Ding, Zihao, et al.
Published: (2025)

WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora
by: Wang, Pengyu, et al.
Published: (2026)

Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking
by: Chen, Yihan, et al.
Published: (2025)

ExpertPrompting: Instructing Large Language Models to be Distinguished Experts
by: Xu, Benfeng, et al.
Published: (2023)

Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability
by: Zhu, Chiwei, et al.
Published: (2025)

ACE-Router: Generalizing History-Aware Routing from MCP Tools to the Agent Web
by: Yao, Zhiyuan, et al.
Published: (2026)

MCP-Zero: Active Tool Discovery for Autonomous LLM Agents
by: Fei, Xiang, et al.
Published: (2025)

MCP-ITP: An Automated Framework for Implicit Tool Poisoning in MCP
by: Li, Ruiqi, et al.
Published: (2026)

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
by: Bandi, Chaithanya, et al.
Published: (2026)

MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark
by: Fan, Shiqing, et al.
Published: (2025)

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
by: Li, Yuanyang, et al.
Published: (2026)

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation
by: Wang, Wenhao, et al.
Published: (2026)

PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools
by: Feng, Tianjun, et al.
Published: (2026)

MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents
by: Zhang, Dongsen, et al.
Published: (2025)

AgentDistill: Training-Free Agent Distillation with Generalizable MCP Boxes
by: Qiu, Jiahao, et al.
Published: (2025)

ParaView-MCP: An Autonomous Visualization Agent with Direct Tool Use
by: Liu, Shusen, et al.
Published: (2025)

AgentMaster: A Multi-Agent Conversational Framework Using A2A and MCP Protocols for Multimodal Information Retrieval and Analysis
by: Liao, Callie C., et al.
Published: (2025)

EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
by: He, Tiantian, et al.
Published: (2026)

CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents
by: Pereira, Kristen, et al.
Published: (2026)