Saved in:
| Main Authors: | Li, Ningyuan, Shen, Haiyang, Liu, Mugeng, Han, Yudong, Shi, Zhuofan, Xie, Sixiong, Ma, Yun |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.22219 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
M3-BENCH: Process-Aware Evaluation of LLM Agents' Social Behaviors in Mixed-Motive Games
by: Xie, Sixiong, et al.
Published: (2026)
by: Xie, Sixiong, et al.
Published: (2026)
DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
by: Xie, Sixiong, et al.
Published: (2026)
by: Xie, Sixiong, et al.
Published: (2026)
MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis
by: Shen, Haiyang, et al.
Published: (2026)
by: Shen, Haiyang, et al.
Published: (2026)
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
by: Shen, Haiyang, et al.
Published: (2026)
by: Shen, Haiyang, et al.
Published: (2026)
Rethinking Explainable Disease Prediction: Synergizing Accuracy and Reliability via Reflective Cognitive Architecture
by: Shao, Zijian, et al.
Published: (2025)
by: Shao, Zijian, et al.
Published: (2025)
ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents
by: Shen, Haiyang, et al.
Published: (2024)
by: Shen, Haiyang, et al.
Published: (2024)
DRAGON: Domain-specific Robust Automatic Data Generation for RAG Optimization
by: Shen, Haiyang, et al.
Published: (2025)
by: Shen, Haiyang, et al.
Published: (2025)
PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent
by: Nie, Hongyi, et al.
Published: (2026)
by: Nie, Hongyi, et al.
Published: (2026)
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
by: Shi, Zhuofan, et al.
Published: (2026)
by: Shi, Zhuofan, et al.
Published: (2026)
SheetDesigner: MLLM-Powered Spreadsheet Layout Generation with Rule-Based and Vision-Based Reflection
by: Chen, Qin, et al.
Published: (2025)
by: Chen, Qin, et al.
Published: (2025)
AgentSearchBench: A Benchmark for AI Agent Search in the Wild
by: Wu, Bin, et al.
Published: (2026)
by: Wu, Bin, et al.
Published: (2026)
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization
by: Sun, Weiwei, et al.
Published: (2025)
by: Sun, Weiwei, et al.
Published: (2025)
FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering
by: Choi, Chanyeol, et al.
Published: (2025)
by: Choi, Chanyeol, et al.
Published: (2025)
DCA-Bench: A Benchmark for Dataset Curation Agents
by: Huang, Benhao, et al.
Published: (2024)
by: Huang, Benhao, et al.
Published: (2024)
ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents
by: Li, Chao, et al.
Published: (2026)
by: Li, Chao, et al.
Published: (2026)
SPAR: Scholar Paper Retrieval with LLM-based Agents for Enhanced Academic Search
by: Shi, Xiaofeng, et al.
Published: (2025)
by: Shi, Xiaofeng, et al.
Published: (2025)
VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents
by: Zhang, Zhengbo, et al.
Published: (2026)
by: Zhang, Zhengbo, et al.
Published: (2026)
PeopleSearchBench: A Multi-Dimensional Benchmark for Evaluating AI-Powered People Search Platforms
by: Wang, Wei, et al.
Published: (2026)
by: Wang, Wei, et al.
Published: (2026)
AlphaPROBE: Alpha Mining via Principled Retrieval and On-graph biased evolution
by: Guo, Taian, et al.
Published: (2026)
by: Guo, Taian, et al.
Published: (2026)
PyBench: Evaluating LLM Agent on various real-world coding tasks
by: Zhang, Yaolun, et al.
Published: (2024)
by: Zhang, Yaolun, et al.
Published: (2024)
WebANNS: Fast and Efficient Approximate Nearest Neighbor Search in Web Browsers
by: Liu, Mugeng, et al.
Published: (2025)
by: Liu, Mugeng, et al.
Published: (2025)
CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs
by: Liu, Yuxuan, et al.
Published: (2026)
by: Liu, Yuxuan, et al.
Published: (2026)
RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision
by: Wang, Jing, et al.
Published: (2026)
by: Wang, Jing, et al.
Published: (2026)
AirQualityBench: A Realistic Evaluation Benchmark for Global Air Quality Forecasting
by: Xu, Xing, et al.
Published: (2026)
by: Xu, Xing, et al.
Published: (2026)
HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
by: Cui, Fan, et al.
Published: (2026)
by: Cui, Fan, et al.
Published: (2026)
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
by: Li, Xiangyi, et al.
Published: (2026)
by: Li, Xiangyi, et al.
Published: (2026)
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
by: Chen, Jingxuan, et al.
Published: (2024)
by: Chen, Jingxuan, et al.
Published: (2024)
MMR-Bench: A Comprehensive Benchmark for Multimodal LLM Routing
by: Ma, Haoxuan, et al.
Published: (2026)
by: Ma, Haoxuan, et al.
Published: (2026)
AInsteinBench: Benchmarking Coding Agents on Scientific Repositories
by: Duston, Titouan, et al.
Published: (2025)
by: Duston, Titouan, et al.
Published: (2025)
OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories
by: Liu, Yibing, et al.
Published: (2026)
by: Liu, Yibing, et al.
Published: (2026)
LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services
by: He, Hang, et al.
Published: (2025)
by: He, Hang, et al.
Published: (2025)
GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis
by: Yu, Bo, et al.
Published: (2026)
by: Yu, Bo, et al.
Published: (2026)
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
by: Shi, Wentao, et al.
Published: (2026)
by: Shi, Wentao, et al.
Published: (2026)
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
by: Li, Keyu, et al.
Published: (2026)
by: Li, Keyu, et al.
Published: (2026)
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
by: Deng, Shihan, et al.
Published: (2024)
by: Deng, Shihan, et al.
Published: (2024)
EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents
by: Liu, Yunqi, et al.
Published: (2026)
by: Liu, Yunqi, et al.
Published: (2026)
DualResearch: Entropy-Gated Dual-Graph Retrieval for Answer Reconstruction
by: Shi, Jinxin, et al.
Published: (2025)
by: Shi, Jinxin, et al.
Published: (2025)
MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use
by: Liu, Wenrui, et al.
Published: (2025)
by: Liu, Wenrui, et al.
Published: (2025)
ProBench: Benchmarking GUI Agents with Accurate Process Information
by: Yang, Leyang, et al.
Published: (2025)
by: Yang, Leyang, et al.
Published: (2025)
VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild
by: Inc, Xiaohongshu
Published: (2026)
by: Inc, Xiaohongshu
Published: (2026)
Similar Items
-
M3-BENCH: Process-Aware Evaluation of LLM Agents' Social Behaviors in Mixed-Motive Games
by: Xie, Sixiong, et al.
Published: (2026) -
DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
by: Xie, Sixiong, et al.
Published: (2026) -
MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis
by: Shen, Haiyang, et al.
Published: (2026) -
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
by: Shen, Haiyang, et al.
Published: (2026) -
Rethinking Explainable Disease Prediction: Synergizing Accuracy and Reliability via Reflective Cognitive Architecture
by: Shao, Zijian, et al.
Published: (2025)