Saved in:
| Main Authors: | Guo, Zikang, Xu, Benfeng, Zhu, Chiwei, Hong, Wentao, Wang, Xiaorui, Mao, Zhendong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.09734 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach
by: Li, Ruizhe, et al.
Published: (2025)
by: Li, Ruizhe, et al.
Published: (2025)
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
by: Du, Mingxuan, et al.
Published: (2025)
by: Du, Mingxuan, et al.
Published: (2025)
MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning
by: Guo, Zikang, et al.
Published: (2025)
by: Guo, Zikang, et al.
Published: (2025)
AgentBench: Evaluating LLMs as Agents
by: Liu, Xiao, et al.
Published: (2023)
by: Liu, Xiao, et al.
Published: (2023)
From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding
by: Zhu, Chiwei, et al.
Published: (2025)
by: Zhu, Chiwei, et al.
Published: (2025)
DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report
by: Li, Ruizhe, et al.
Published: (2026)
by: Li, Ruizhe, et al.
Published: (2026)
MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
by: Zong, Xuanjun, et al.
Published: (2025)
by: Zong, Xuanjun, et al.
Published: (2025)
FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol
by: Zhu, Jie, et al.
Published: (2026)
by: Zhu, Jie, et al.
Published: (2026)
MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools
by: Wang, Wenhao, et al.
Published: (2025)
by: Wang, Wenhao, et al.
Published: (2025)
LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?
by: Mo, Guozhao, et al.
Published: (2025)
by: Mo, Guozhao, et al.
Published: (2025)
FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering
by: Lee, Gyubok, et al.
Published: (2025)
by: Lee, Gyubok, et al.
Published: (2025)
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
by: Yin, Ming, et al.
Published: (2025)
by: Yin, Ming, et al.
Published: (2025)
FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents
by: Zhu, Chiwei, et al.
Published: (2026)
by: Zhu, Chiwei, et al.
Published: (2026)
MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use
by: Liu, Wenrui, et al.
Published: (2025)
by: Liu, Wenrui, et al.
Published: (2025)
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
by: Wang, Zhenting, et al.
Published: (2025)
by: Wang, Zhenting, et al.
Published: (2025)
MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models
by: Liu, Zhiwei, et al.
Published: (2025)
by: Liu, Zhiwei, et al.
Published: (2025)
Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles
by: Wang, Shaohan, et al.
Published: (2026)
by: Wang, Shaohan, et al.
Published: (2026)
TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments
by: Xu, Zhangchen, et al.
Published: (2025)
by: Xu, Zhangchen, et al.
Published: (2025)
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
by: Luo, Ziyang, et al.
Published: (2025)
by: Luo, Ziyang, et al.
Published: (2025)
A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces
by: Du, Mingxuan, et al.
Published: (2026)
by: Du, Mingxuan, et al.
Published: (2026)
MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments
by: Ganapavarapu, Giridhar, et al.
Published: (2026)
by: Ganapavarapu, Giridhar, et al.
Published: (2026)
Network and Systems Performance Characterization of MCP-Enabled LLM Agents
by: Ding, Zihao, et al.
Published: (2025)
by: Ding, Zihao, et al.
Published: (2025)
WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora
by: Wang, Pengyu, et al.
Published: (2026)
by: Wang, Pengyu, et al.
Published: (2026)
Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking
by: Chen, Yihan, et al.
Published: (2025)
by: Chen, Yihan, et al.
Published: (2025)
ExpertPrompting: Instructing Large Language Models to be Distinguished Experts
by: Xu, Benfeng, et al.
Published: (2023)
by: Xu, Benfeng, et al.
Published: (2023)
Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability
by: Zhu, Chiwei, et al.
Published: (2025)
by: Zhu, Chiwei, et al.
Published: (2025)
ACE-Router: Generalizing History-Aware Routing from MCP Tools to the Agent Web
by: Yao, Zhiyuan, et al.
Published: (2026)
by: Yao, Zhiyuan, et al.
Published: (2026)
MCP-Zero: Active Tool Discovery for Autonomous LLM Agents
by: Fei, Xiang, et al.
Published: (2025)
by: Fei, Xiang, et al.
Published: (2025)
MCP-ITP: An Automated Framework for Implicit Tool Poisoning in MCP
by: Li, Ruiqi, et al.
Published: (2026)
by: Li, Ruiqi, et al.
Published: (2026)
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
by: Bandi, Chaithanya, et al.
Published: (2026)
by: Bandi, Chaithanya, et al.
Published: (2026)
MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark
by: Fan, Shiqing, et al.
Published: (2025)
by: Fan, Shiqing, et al.
Published: (2025)
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
by: Li, Yuanyang, et al.
Published: (2026)
by: Li, Yuanyang, et al.
Published: (2026)
MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation
by: Wang, Wenhao, et al.
Published: (2026)
by: Wang, Wenhao, et al.
Published: (2026)
PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools
by: Feng, Tianjun, et al.
Published: (2026)
by: Feng, Tianjun, et al.
Published: (2026)
MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents
by: Zhang, Dongsen, et al.
Published: (2025)
by: Zhang, Dongsen, et al.
Published: (2025)
AgentDistill: Training-Free Agent Distillation with Generalizable MCP Boxes
by: Qiu, Jiahao, et al.
Published: (2025)
by: Qiu, Jiahao, et al.
Published: (2025)
ParaView-MCP: An Autonomous Visualization Agent with Direct Tool Use
by: Liu, Shusen, et al.
Published: (2025)
by: Liu, Shusen, et al.
Published: (2025)
AgentMaster: A Multi-Agent Conversational Framework Using A2A and MCP Protocols for Multimodal Information Retrieval and Analysis
by: Liao, Callie C., et al.
Published: (2025)
by: Liao, Callie C., et al.
Published: (2025)
EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
by: He, Tiantian, et al.
Published: (2026)
by: He, Tiantian, et al.
Published: (2026)
CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents
by: Pereira, Kristen, et al.
Published: (2026)
by: Pereira, Kristen, et al.
Published: (2026)
Similar Items
-
Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach
by: Li, Ruizhe, et al.
Published: (2025) -
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
by: Du, Mingxuan, et al.
Published: (2025) -
MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning
by: Guo, Zikang, et al.
Published: (2025) -
AgentBench: Evaluating LLMs as Agents
by: Liu, Xiao, et al.
Published: (2023) -
From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding
by: Zhu, Chiwei, et al.
Published: (2025)