Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	CocoaBench Team, Hao, Shibo, Zhang, Zhining, Liang, Zhiqi, Liu, Tianyang, Zha, Yuheng, Gao, Qiyue, Chen, Jixuan, Wang, Zilong, Cheng, Zhoujun, Zhang, Haoxiang, Wang, Junli, Jin, Hexi, Zheng, Boyuan, Zhou, Kun, Wang, Yu, Yao, Feng, Liu, Licheng, Li, Yijiang, Li, Zhifei, Han, Zhengtao, Promthaw, Pracha, Cerruti, Tommaso, Fu, Xiaohan, Ma, Ziqiao, Shang, Jingbo, Qin, Lianhui, McAuley, Julian, Xing, Eric P., Liu, Zhengzhong, Srivastava, Rupesh Kumar, Hu, Zhiting
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.11201
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908961362411520
author	CocoaBench Team Hao, Shibo Zhang, Zhining Liang, Zhiqi Liu, Tianyang Zha, Yuheng Gao, Qiyue Chen, Jixuan Wang, Zilong Cheng, Zhoujun Zhang, Haoxiang Wang, Junli Jin, Hexi Zheng, Boyuan Zhou, Kun Wang, Yu Yao, Feng Liu, Licheng Li, Yijiang Li, Zhifei Han, Zhengtao Promthaw, Pracha Cerruti, Tommaso Fu, Xiaohan Ma, Ziqiao Shang, Jingbo Qin, Lianhui McAuley, Julian Xing, Eric P. Liu, Zhengzhong Srivastava, Rupesh Kumar Hu, Zhiting
author_facet	CocoaBench Team Hao, Shibo Zhang, Zhining Liang, Zhiqi Liu, Tianyang Zha, Yuheng Gao, Qiyue Chen, Jixuan Wang, Zilong Cheng, Zhoujun Zhang, Haoxiang Wang, Junli Jin, Hexi Zheng, Boyuan Zhou, Kun Wang, Yu Yao, Feng Liu, Licheng Li, Yijiang Li, Zhifei Han, Zhengtao Promthaw, Pracha Cerruti, Tommaso Fu, Xiaohan Ma, Ziqiao Shang, Jingbo Qin, Lianhui McAuley, Julian Xing, Eric P. Liu, Zhengzhong Srivastava, Rupesh Kumar Hu, Zhiting
contents	LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_11201
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	CocoaBench: Evaluating Unified Digital Agents in the Wild CocoaBench Team Hao, Shibo Zhang, Zhining Liang, Zhiqi Liu, Tianyang Zha, Yuheng Gao, Qiyue Chen, Jixuan Wang, Zilong Cheng, Zhoujun Zhang, Haoxiang Wang, Junli Jin, Hexi Zheng, Boyuan Zhou, Kun Wang, Yu Yao, Feng Liu, Licheng Li, Yijiang Li, Zhifei Han, Zhengtao Promthaw, Pracha Cerruti, Tommaso Fu, Xiaohan Ma, Ziqiao Shang, Jingbo Qin, Lianhui McAuley, Julian Xing, Eric P. Liu, Zhengzhong Srivastava, Rupesh Kumar Hu, Zhiting Computation and Language Artificial Intelligence LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.
title	CocoaBench: Evaluating Unified Digital Agents in the Wild
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2604.11201

Similar Items