Saved in:
| Main Authors: | Wang, Danqing, Sivaraman, Akshay, Li, Lei |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2606.01815 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
by: Long, Xiang, et al.
Published: (2026)
by: Long, Xiang, et al.
Published: (2026)
Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
by: Zhao, Songwen, et al.
Published: (2025)
by: Zhao, Songwen, et al.
Published: (2025)
Agent-SafetyBench: Evaluating the Safety of LLM Agents
by: Zhang, Zhexin, et al.
Published: (2024)
by: Zhang, Zhexin, et al.
Published: (2024)
OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows
by: Wang, Weixuan, et al.
Published: (2025)
by: Wang, Weixuan, et al.
Published: (2025)
SimulBench: Evaluating Language Models with Creative Simulation Tasks
by: Jia, Qi, et al.
Published: (2024)
by: Jia, Qi, et al.
Published: (2024)
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
by: Wang, Zhenting, et al.
Published: (2025)
by: Wang, Zhenting, et al.
Published: (2025)
LegalAgentBench: Evaluating LLM Agents in Legal Domain
by: Li, Haitao, et al.
Published: (2024)
by: Li, Haitao, et al.
Published: (2024)
HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?
by: Peng, Weihan, et al.
Published: (2026)
by: Peng, Weihan, et al.
Published: (2026)
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
by: Hu, Xiaomeng, et al.
Published: (2026)
by: Hu, Xiaomeng, et al.
Published: (2026)
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
by: Schmidt, Jan-Philipp
Published: (2026)
by: Schmidt, Jan-Philipp
Published: (2026)
Scaling LLM Inference with Optimized Sample Compute Allocation
by: Zhang, Kexun, et al.
Published: (2024)
by: Zhang, Kexun, et al.
Published: (2024)
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
by: Deng, Shihan, et al.
Published: (2024)
by: Deng, Shihan, et al.
Published: (2024)
Reliable LLM-based User Simulator for Task-Oriented Dialogue Systems
by: Sekulić, Ivan, et al.
Published: (2024)
by: Sekulić, Ivan, et al.
Published: (2024)
PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations
by: Lei, Yingjie
Published: (2026)
by: Lei, Yingjie
Published: (2026)
GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
by: Costarelli, Anthony, et al.
Published: (2024)
by: Costarelli, Anthony, et al.
Published: (2024)
Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents
by: Chopra, Harshita, et al.
Published: (2026)
by: Chopra, Harshita, et al.
Published: (2026)
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents
by: Zhu, Kunlun, et al.
Published: (2025)
by: Zhu, Kunlun, et al.
Published: (2025)
ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks
by: Li, Minghao, et al.
Published: (2025)
by: Li, Minghao, et al.
Published: (2025)
ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks
by: Song, Yuanyi, et al.
Published: (2025)
by: Song, Yuanyi, et al.
Published: (2025)
CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency
by: Guo, Jiacheng, et al.
Published: (2025)
by: Guo, Jiacheng, et al.
Published: (2025)
HumanLLM: Towards Personalized Understanding and Simulation of Human Nature
by: Lei, Yuxuan, et al.
Published: (2026)
by: Lei, Yuxuan, et al.
Published: (2026)
CRAB: A Benchmark for Evaluating Curation of Retrieval-Augmented LLMs in Biomedicine
by: Zhong, Hanmeng, et al.
Published: (2025)
by: Zhong, Hanmeng, et al.
Published: (2025)
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
by: Yang, Wang, et al.
Published: (2026)
by: Yang, Wang, et al.
Published: (2026)
TypedThinker: Diversify Large Language Model Reasoning with Typed Thinking
by: Wang, Danqing, et al.
Published: (2024)
by: Wang, Danqing, et al.
Published: (2024)
Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies
by: Tang, Zirui, et al.
Published: (2026)
by: Tang, Zirui, et al.
Published: (2026)
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types
by: Mou, Yutao, et al.
Published: (2024)
by: Mou, Yutao, et al.
Published: (2024)
Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation
by: Liang, Chen, et al.
Published: (2024)
by: Liang, Chen, et al.
Published: (2024)
UserBench: An Interactive Gym Environment for User-Centric Agents
by: Qian, Cheng, et al.
Published: (2025)
by: Qian, Cheng, et al.
Published: (2025)
Learning Personalized Alignment for Evaluating Open-ended Text Generation
by: Wang, Danqing, et al.
Published: (2023)
by: Wang, Danqing, et al.
Published: (2023)
AgentBench: Evaluating LLMs as Agents
by: Liu, Xiao, et al.
Published: (2023)
by: Liu, Xiao, et al.
Published: (2023)
VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
by: He, Wei, et al.
Published: (2025)
by: He, Wei, et al.
Published: (2025)
HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing
by: Feng, Andrew Zhuoer, et al.
Published: (2026)
by: Feng, Andrew Zhuoer, et al.
Published: (2026)
The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs
by: Baidya, Avinash, et al.
Published: (2025)
by: Baidya, Avinash, et al.
Published: (2025)
Task-Dependent Evaluation of LLM Output Homogenization: A Taxonomy-Guided Framework
by: Jain, Shomik, et al.
Published: (2025)
by: Jain, Shomik, et al.
Published: (2025)
Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents
by: Wang, Ziyi, et al.
Published: (2026)
by: Wang, Ziyi, et al.
Published: (2026)
Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models
by: Wang, Danqing, et al.
Published: (2024)
by: Wang, Danqing, et al.
Published: (2024)
ClawBench: Can AI Agents Complete Everyday Online Tasks?
by: Zhang, Yuxuan, et al.
Published: (2026)
by: Zhang, Yuxuan, et al.
Published: (2026)
Simulating Classroom Education with LLM-Empowered Agents
by: Zhang, Zheyuan, et al.
Published: (2024)
by: Zhang, Zheyuan, et al.
Published: (2024)
REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?
by: Jiang, Chenxi, et al.
Published: (2025)
by: Jiang, Chenxi, et al.
Published: (2025)
Enhancing Dialogue State Tracking Models through LLM-backed User-Agents Simulation
by: Niu, Cheng, et al.
Published: (2024)
by: Niu, Cheng, et al.
Published: (2024)
Similar Items
-
LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
by: Long, Xiang, et al.
Published: (2026) -
Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
by: Zhao, Songwen, et al.
Published: (2025) -
Agent-SafetyBench: Evaluating the Safety of LLM Agents
by: Zhang, Zhexin, et al.
Published: (2024) -
OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows
by: Wang, Weixuan, et al.
Published: (2025) -
SimulBench: Evaluating Language Models with Creative Simulation Tasks
by: Jia, Qi, et al.
Published: (2024)