Saved in:
| Main Authors: | Li, Minghao, Zeng, Ying, Cheng, Zhihao, Ma, Cong, Jia, Kai |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.15804 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DeepSurvey-Bench: Evaluating Academic Value of Automatically Generated Scientific Survey
by: Zhang, Guo-Biao, et al.
Published: (2026)
by: Zhang, Guo-Biao, et al.
Published: (2026)
Dr. Bench: A Multidimensional Evaluation for Deep Research Agents, from Answers to Reports
by: Yao, Yang, et al.
Published: (2025)
by: Yao, Yang, et al.
Published: (2025)
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
by: Yang, Wang, et al.
Published: (2026)
by: Yang, Wang, et al.
Published: (2026)
AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research
by: Li, Yishan, et al.
Published: (2026)
by: Li, Yishan, et al.
Published: (2026)
GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
by: Costarelli, Anthony, et al.
Published: (2024)
by: Costarelli, Anthony, et al.
Published: (2024)
AgentBench: Evaluating LLMs as Agents
by: Liu, Xiao, et al.
Published: (2023)
by: Liu, Xiao, et al.
Published: (2023)
O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL
by: Yao, Yi, et al.
Published: (2026)
by: Yao, Yi, et al.
Published: (2026)
CocoaBench: Evaluating Unified Digital Agents in the Wild
by: CocoaBench Team, et al.
Published: (2026)
by: CocoaBench Team, et al.
Published: (2026)
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks
by: Hu, Xueyu, et al.
Published: (2024)
by: Hu, Xueyu, et al.
Published: (2024)
ClawBench: Can AI Agents Complete Everyday Online Tasks?
by: Zhang, Yuxuan, et al.
Published: (2026)
by: Zhang, Yuxuan, et al.
Published: (2026)
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
by: Schmidt, Jan-Philipp
Published: (2026)
by: Schmidt, Jan-Philipp
Published: (2026)
Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents
by: Shen, Yiting, et al.
Published: (2026)
by: Shen, Yiting, et al.
Published: (2026)
Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design
by: Zhu, Bin, et al.
Published: (2026)
by: Zhu, Bin, et al.
Published: (2026)
SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence
by: Bu, Yuyan, et al.
Published: (2026)
by: Bu, Yuyan, et al.
Published: (2026)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code
by: Tang, Xiangru, et al.
Published: (2023)
by: Tang, Xiangru, et al.
Published: (2023)
MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols
by: Du, Yuhao, et al.
Published: (2025)
by: Du, Yuhao, et al.
Published: (2025)
StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall
by: Wu, Yerong, et al.
Published: (2026)
by: Wu, Yerong, et al.
Published: (2026)
CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing
by: Qian, Cheng, et al.
Published: (2026)
by: Qian, Cheng, et al.
Published: (2026)
ReportLogic: Evaluating Logical Quality in Deep Research Reports
by: Zhao, Jujia, et al.
Published: (2026)
by: Zhao, Jujia, et al.
Published: (2026)
MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents
by: Tan, Haoran, et al.
Published: (2025)
by: Tan, Haoran, et al.
Published: (2025)
DDO: Dual-Decision Optimization for LLM-Based Medical Consultation via Multi-Agent Collaboration
by: Jia, Zhihao, et al.
Published: (2025)
by: Jia, Zhihao, et al.
Published: (2025)
PsychCounsel-Bench: Evaluating the Psychology Intelligence of Large Language Models
by: Zeng, Min
Published: (2025)
by: Zeng, Min
Published: (2025)
ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases
by: Li, Yuchong, et al.
Published: (2025)
by: Li, Yuchong, et al.
Published: (2025)
DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis
by: Patel, Liana, et al.
Published: (2025)
by: Patel, Liana, et al.
Published: (2025)
Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
by: Pires, Ramon, et al.
Published: (2026)
by: Pires, Ramon, et al.
Published: (2026)
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
by: Deng, Shihan, et al.
Published: (2024)
by: Deng, Shihan, et al.
Published: (2024)
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research
by: Chen, Hui, et al.
Published: (2025)
by: Chen, Hui, et al.
Published: (2025)
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents
by: Ma, Chang, et al.
Published: (2024)
by: Ma, Chang, et al.
Published: (2024)
InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents
by: Du, Yaxin, et al.
Published: (2025)
by: Du, Yaxin, et al.
Published: (2025)
SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys
by: Zhao, Jiahao, et al.
Published: (2025)
by: Zhao, Jiahao, et al.
Published: (2025)
GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks
by: Krechetova, Varvara, et al.
Published: (2025)
by: Krechetova, Varvara, et al.
Published: (2025)
EduResearchBench: A Hierarchical Atomic Task Decomposition Benchmark for Full-Lifecycle Educational Research
by: Yue, Houping, et al.
Published: (2026)
by: Yue, Houping, et al.
Published: (2026)
TaskBench: Benchmarking Large Language Models for Task Automation
by: Shen, Yongliang, et al.
Published: (2023)
by: Shen, Yongliang, et al.
Published: (2023)
EVADE-Bench: Multimodal Benchmark for Evaluating and Enhancing Evasive Content Detection
by: Xu, Ancheng, et al.
Published: (2025)
by: Xu, Ancheng, et al.
Published: (2025)
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
by: Lù, Xing Han, et al.
Published: (2025)
by: Lù, Xing Han, et al.
Published: (2025)
EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance
by: Song, Siyao, et al.
Published: (2025)
by: Song, Siyao, et al.
Published: (2025)
DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping
by: Fan, Wei, et al.
Published: (2025)
by: Fan, Wei, et al.
Published: (2025)
VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
by: He, Wei, et al.
Published: (2025)
by: He, Wei, et al.
Published: (2025)
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
by: Liu, Yujie, et al.
Published: (2025)
by: Liu, Yujie, et al.
Published: (2025)
Towards Personalized Deep Research: Benchmarks and Evaluations
by: Liang, Yuan, et al.
Published: (2025)
by: Liang, Yuan, et al.
Published: (2025)
Similar Items
-
DeepSurvey-Bench: Evaluating Academic Value of Automatically Generated Scientific Survey
by: Zhang, Guo-Biao, et al.
Published: (2026) -
Dr. Bench: A Multidimensional Evaluation for Deep Research Agents, from Answers to Reports
by: Yao, Yang, et al.
Published: (2025) -
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
by: Yang, Wang, et al.
Published: (2026) -
AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research
by: Li, Yishan, et al.
Published: (2026) -
GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
by: Costarelli, Anthony, et al.
Published: (2024)