:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Diao, Lingxiao, Xu, Xinyue, Sun, Wanxuan, Yang, Cheng, Zhang, Zhuosheng
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2505.11368
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues
by: Ghazarian, Sarik, et al.
Published: (2025)

FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents
by: Xiao, Ruixuan, et al.
Published: (2024)

Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM
by: Zhang, Shaoqing, et al.
Published: (2024)

Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System
by: Guo, Yuan, et al.
Published: (2025)

Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks
by: Wu, Zongru, et al.
Published: (2025)

ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
by: Zhao, Haochen, et al.
Published: (2024)

Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
by: Deng, Shihan, et al.
Published: (2024)

LegalAgentBench: Evaluating LLM Agents in Legal Domain
by: Li, Haitao, et al.
Published: (2024)

You Only Look at Screens: Multimodal Chain-of-Action Agents
by: Zhang, Zhuosheng, et al.
Published: (2023)

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
by: Yuan, Tongxin, et al.
Published: (2024)

GroupTravelBench: Benchmarking LLM Agents on Multi-Person Travel Planning
by: Cheng, Xiang, et al.
Published: (2026)

CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation
by: Ma, Xinbei, et al.
Published: (2024)

ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks
by: Song, Yuanyi, et al.
Published: (2025)

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents
by: Wang, Jiangyuan, et al.
Published: (2025)

DataSciBench: An LLM Agent Benchmark for Data Science
by: Zhang, Dan, et al.
Published: (2025)

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
by: Yang, Jingbo, et al.
Published: (2026)

Agent-SafetyBench: Evaluating the Safety of LLM Agents
by: Zhang, Zhexin, et al.
Published: (2024)

Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions
by: Ma, Xinbei, et al.
Published: (2024)

MemGuide: Intent-Driven Memory Selection for Goal-Oriented Multi-Session LLM Agents
by: Du, Yiming, et al.
Published: (2025)

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
by: He, Wei, et al.
Published: (2025)

FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering
by: Lee, Gyubok, et al.
Published: (2025)

DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems
by: Zou, Anni, et al.
Published: (2024)

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation
by: Wen, Bosi, et al.
Published: (2026)

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
by: Perlitz, Yotam, et al.
Published: (2024)

OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding
by: Ding, Deming, et al.
Published: (2026)

CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency
by: Guo, Jiacheng, et al.
Published: (2025)

Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities
by: Ju, Tianjie, et al.
Published: (2024)

GEM: Gaussian Embedding Modeling for Out-of-Distribution Detection in GUI Agents
by: Wu, Zheng, et al.
Published: (2025)

PodBench: A Comprehensive Benchmark for Instruction-Aware Audio-Oriented Podcast Script Generation
by: Xu, Chenning, et al.
Published: (2026)

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences
by: Nguyen, Bang, et al.
Published: (2026)

StreamBench: Towards Benchmarking Continuous Improvement of Language Agents
by: Wu, Cheng-Kuang, et al.
Published: (2024)

CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization
by: Sun, Weiwei, et al.
Published: (2025)

BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks
by: Tu, Xinming, et al.
Published: (2026)

EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios
by: Xu, Bin, et al.
Published: (2025)

Self-Prompting Large Language Models for Zero-Shot Open-Domain QA
by: Li, Junlong, et al.
Published: (2022)

LCTG Bench: LLM Controlled Text Generation Benchmark
by: Kurihara, Kentaro, et al.
Published: (2025)

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
by: Wang, Zhenting, et al.
Published: (2025)

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
by: Hu, Lingxiang, et al.
Published: (2026)

ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents
by: Jeon, YoungHoon, et al.
Published: (2026)

FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain
by: Afzal, Anum, et al.
Published: (2025)