Saved in:
| Main Authors: | Diao, Lingxiao, Xu, Xinyue, Sun, Wanxuan, Yang, Cheng, Zhang, Zhuosheng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.11368 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues
by: Ghazarian, Sarik, et al.
Published: (2025)
by: Ghazarian, Sarik, et al.
Published: (2025)
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents
by: Xiao, Ruixuan, et al.
Published: (2024)
by: Xiao, Ruixuan, et al.
Published: (2024)
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM
by: Zhang, Shaoqing, et al.
Published: (2024)
by: Zhang, Shaoqing, et al.
Published: (2024)
Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System
by: Guo, Yuan, et al.
Published: (2025)
by: Guo, Yuan, et al.
Published: (2025)
Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks
by: Wu, Zongru, et al.
Published: (2025)
by: Wu, Zongru, et al.
Published: (2025)
ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
by: Zhao, Haochen, et al.
Published: (2024)
by: Zhao, Haochen, et al.
Published: (2024)
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
by: Deng, Shihan, et al.
Published: (2024)
by: Deng, Shihan, et al.
Published: (2024)
LegalAgentBench: Evaluating LLM Agents in Legal Domain
by: Li, Haitao, et al.
Published: (2024)
by: Li, Haitao, et al.
Published: (2024)
You Only Look at Screens: Multimodal Chain-of-Action Agents
by: Zhang, Zhuosheng, et al.
Published: (2023)
by: Zhang, Zhuosheng, et al.
Published: (2023)
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
by: Yuan, Tongxin, et al.
Published: (2024)
by: Yuan, Tongxin, et al.
Published: (2024)
GroupTravelBench: Benchmarking LLM Agents on Multi-Person Travel Planning
by: Cheng, Xiang, et al.
Published: (2026)
by: Cheng, Xiang, et al.
Published: (2026)
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation
by: Ma, Xinbei, et al.
Published: (2024)
by: Ma, Xinbei, et al.
Published: (2024)
ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks
by: Song, Yuanyi, et al.
Published: (2025)
by: Song, Yuanyi, et al.
Published: (2025)
ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents
by: Wang, Jiangyuan, et al.
Published: (2025)
by: Wang, Jiangyuan, et al.
Published: (2025)
DataSciBench: An LLM Agent Benchmark for Data Science
by: Zhang, Dan, et al.
Published: (2025)
by: Zhang, Dan, et al.
Published: (2025)
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
by: Yang, Jingbo, et al.
Published: (2026)
by: Yang, Jingbo, et al.
Published: (2026)
Agent-SafetyBench: Evaluating the Safety of LLM Agents
by: Zhang, Zhexin, et al.
Published: (2024)
by: Zhang, Zhexin, et al.
Published: (2024)
Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions
by: Ma, Xinbei, et al.
Published: (2024)
by: Ma, Xinbei, et al.
Published: (2024)
MemGuide: Intent-Driven Memory Selection for Goal-Oriented Multi-Session LLM Agents
by: Du, Yiming, et al.
Published: (2025)
by: Du, Yiming, et al.
Published: (2025)
VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
by: He, Wei, et al.
Published: (2025)
by: He, Wei, et al.
Published: (2025)
FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering
by: Lee, Gyubok, et al.
Published: (2025)
by: Lee, Gyubok, et al.
Published: (2025)
DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems
by: Zou, Anni, et al.
Published: (2024)
by: Zou, Anni, et al.
Published: (2024)
IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation
by: Wen, Bosi, et al.
Published: (2026)
by: Wen, Bosi, et al.
Published: (2026)
Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
by: Perlitz, Yotam, et al.
Published: (2024)
by: Perlitz, Yotam, et al.
Published: (2024)
OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding
by: Ding, Deming, et al.
Published: (2026)
by: Ding, Deming, et al.
Published: (2026)
CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency
by: Guo, Jiacheng, et al.
Published: (2025)
by: Guo, Jiacheng, et al.
Published: (2025)
Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities
by: Ju, Tianjie, et al.
Published: (2024)
by: Ju, Tianjie, et al.
Published: (2024)
GEM: Gaussian Embedding Modeling for Out-of-Distribution Detection in GUI Agents
by: Wu, Zheng, et al.
Published: (2025)
by: Wu, Zheng, et al.
Published: (2025)
PodBench: A Comprehensive Benchmark for Instruction-Aware Audio-Oriented Podcast Script Generation
by: Xu, Chenning, et al.
Published: (2026)
by: Xu, Chenning, et al.
Published: (2026)
ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences
by: Nguyen, Bang, et al.
Published: (2026)
by: Nguyen, Bang, et al.
Published: (2026)
StreamBench: Towards Benchmarking Continuous Improvement of Language Agents
by: Wu, Cheng-Kuang, et al.
Published: (2024)
by: Wu, Cheng-Kuang, et al.
Published: (2024)
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization
by: Sun, Weiwei, et al.
Published: (2025)
by: Sun, Weiwei, et al.
Published: (2025)
BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks
by: Tu, Xinming, et al.
Published: (2026)
by: Tu, Xinming, et al.
Published: (2026)
EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios
by: Xu, Bin, et al.
Published: (2025)
by: Xu, Bin, et al.
Published: (2025)
Self-Prompting Large Language Models for Zero-Shot Open-Domain QA
by: Li, Junlong, et al.
Published: (2022)
by: Li, Junlong, et al.
Published: (2022)
LCTG Bench: LLM Controlled Text Generation Benchmark
by: Kurihara, Kentaro, et al.
Published: (2025)
by: Kurihara, Kentaro, et al.
Published: (2025)
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
by: Wang, Zhenting, et al.
Published: (2025)
by: Wang, Zhenting, et al.
Published: (2025)
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
by: Hu, Lingxiang, et al.
Published: (2026)
by: Hu, Lingxiang, et al.
Published: (2026)
ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents
by: Jeon, YoungHoon, et al.
Published: (2026)
by: Jeon, YoungHoon, et al.
Published: (2026)
FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain
by: Afzal, Anum, et al.
Published: (2025)
by: Afzal, Anum, et al.
Published: (2025)
Similar Items
-
TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues
by: Ghazarian, Sarik, et al.
Published: (2025) -
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents
by: Xiao, Ruixuan, et al.
Published: (2024) -
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM
by: Zhang, Shaoqing, et al.
Published: (2024) -
Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System
by: Guo, Yuan, et al.
Published: (2025) -
Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks
by: Wu, Zongru, et al.
Published: (2025)