:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Danqing, Sivaraman, Akshay, Li, Lei
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2606.01815
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
by: Long, Xiang, et al.
Published: (2026)

Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
by: Zhao, Songwen, et al.
Published: (2025)

Agent-SafetyBench: Evaluating the Safety of LLM Agents
by: Zhang, Zhexin, et al.
Published: (2024)

OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows
by: Wang, Weixuan, et al.
Published: (2025)

SimulBench: Evaluating Language Models with Creative Simulation Tasks
by: Jia, Qi, et al.
Published: (2024)

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
by: Wang, Zhenting, et al.
Published: (2025)

LegalAgentBench: Evaluating LLM Agents in Legal Domain
by: Li, Haitao, et al.
Published: (2024)

HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?
by: Peng, Weihan, et al.
Published: (2026)

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
by: Hu, Xiaomeng, et al.
Published: (2026)

ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
by: Schmidt, Jan-Philipp
Published: (2026)

Scaling LLM Inference with Optimized Sample Compute Allocation
by: Zhang, Kexun, et al.
Published: (2024)

Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
by: Deng, Shihan, et al.
Published: (2024)

Reliable LLM-based User Simulator for Task-Oriented Dialogue Systems
by: Sekulić, Ivan, et al.
Published: (2024)

PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations
by: Lei, Yingjie
Published: (2026)

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
by: Costarelli, Anthony, et al.
Published: (2024)

Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents
by: Chopra, Harshita, et al.
Published: (2026)

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents
by: Zhu, Kunlun, et al.
Published: (2025)

ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks
by: Li, Minghao, et al.
Published: (2025)

ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks
by: Song, Yuanyi, et al.
Published: (2025)

CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency
by: Guo, Jiacheng, et al.
Published: (2025)

HumanLLM: Towards Personalized Understanding and Simulation of Human Nature
by: Lei, Yuxuan, et al.
Published: (2026)

CRAB: A Benchmark for Evaluating Curation of Retrieval-Augmented LLMs in Biomedicine
by: Zhong, Hanmeng, et al.
Published: (2025)

AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
by: Yang, Wang, et al.
Published: (2026)

TypedThinker: Diversify Large Language Model Reasoning with Typed Thinking
by: Wang, Danqing, et al.
Published: (2024)

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies
by: Tang, Zirui, et al.
Published: (2026)

SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types
by: Mou, Yutao, et al.
Published: (2024)

Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation
by: Liang, Chen, et al.
Published: (2024)

UserBench: An Interactive Gym Environment for User-Centric Agents
by: Qian, Cheng, et al.
Published: (2025)

Learning Personalized Alignment for Evaluating Open-ended Text Generation
by: Wang, Danqing, et al.
Published: (2023)

AgentBench: Evaluating LLMs as Agents
by: Liu, Xiao, et al.
Published: (2023)

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
by: He, Wei, et al.
Published: (2025)

HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing
by: Feng, Andrew Zhuoer, et al.
Published: (2026)

The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs
by: Baidya, Avinash, et al.
Published: (2025)

Task-Dependent Evaluation of LLM Output Homogenization: A Taxonomy-Guided Framework
by: Jain, Shomik, et al.
Published: (2025)

Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents
by: Wang, Ziyi, et al.
Published: (2026)

Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models
by: Wang, Danqing, et al.
Published: (2024)

ClawBench: Can AI Agents Complete Everyday Online Tasks?
by: Zhang, Yuxuan, et al.
Published: (2026)

Simulating Classroom Education with LLM-Empowered Agents
by: Zhang, Zheyuan, et al.
Published: (2024)

REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?
by: Jiang, Chenxi, et al.
Published: (2025)

Enhancing Dialogue State Tracking Models through LLM-backed User-Agents Simulation
by: Niu, Cheng, et al.
Published: (2024)