_version_ 1866908961362411520
author CocoaBench Team
Hao, Shibo
Zhang, Zhining
Liang, Zhiqi
Liu, Tianyang
Zha, Yuheng
Gao, Qiyue
Chen, Jixuan
Wang, Zilong
Cheng, Zhoujun
Zhang, Haoxiang
Wang, Junli
Jin, Hexi
Zheng, Boyuan
Zhou, Kun
Wang, Yu
Yao, Feng
Liu, Licheng
Li, Yijiang
Li, Zhifei
Han, Zhengtao
Promthaw, Pracha
Cerruti, Tommaso
Fu, Xiaohan
Ma, Ziqiao
Shang, Jingbo
Qin, Lianhui
McAuley, Julian
Xing, Eric P.
Liu, Zhengzhong
Srivastava, Rupesh Kumar
Hu, Zhiting
author_facet CocoaBench Team
Hao, Shibo
Zhang, Zhining
Liang, Zhiqi
Liu, Tianyang
Zha, Yuheng
Gao, Qiyue
Chen, Jixuan
Wang, Zilong
Cheng, Zhoujun
Zhang, Haoxiang
Wang, Junli
Jin, Hexi
Zheng, Boyuan
Zhou, Kun
Wang, Yu
Yao, Feng
Liu, Licheng
Li, Yijiang
Li, Zhifei
Han, Zhengtao
Promthaw, Pracha
Cerruti, Tommaso
Fu, Xiaohan
Ma, Ziqiao
Shang, Jingbo
Qin, Lianhui
McAuley, Julian
Xing, Eric P.
Liu, Zhengzhong
Srivastava, Rupesh Kumar
Hu, Zhiting
contents LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.
format Preprint
id arxiv_https___arxiv_org_abs_2604_11201
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle CocoaBench: Evaluating Unified Digital Agents in the Wild
CocoaBench Team
Hao, Shibo
Zhang, Zhining
Liang, Zhiqi
Liu, Tianyang
Zha, Yuheng
Gao, Qiyue
Chen, Jixuan
Wang, Zilong
Cheng, Zhoujun
Zhang, Haoxiang
Wang, Junli
Jin, Hexi
Zheng, Boyuan
Zhou, Kun
Wang, Yu
Yao, Feng
Liu, Licheng
Li, Yijiang
Li, Zhifei
Han, Zhengtao
Promthaw, Pracha
Cerruti, Tommaso
Fu, Xiaohan
Ma, Ziqiao
Shang, Jingbo
Qin, Lianhui
McAuley, Julian
Xing, Eric P.
Liu, Zhengzhong
Srivastava, Rupesh Kumar
Hu, Zhiting
Computation and Language
Artificial Intelligence
LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.
title CocoaBench: Evaluating Unified Digital Agents in the Wild
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2604.11201