Saved in:
Bibliographic Details
Main Authors: Luo, Yulin, Fan, Chun-Kai, Dong, Menghang, Shi, Jiayu, Zhao, Mengdi, Zhang, Bo-Wen, Chi, Cheng, Liu, Jiaming, Dai, Gaole, Zhang, Rongyu, An, Ruichuan, Wu, Kun, Che, Zhengping, Xie, Shaoxuan, Yao, Guocai, Zhao, Zhongxia, Wang, Pengwei, Liu, Guang, Wang, Zhongyuan, Huang, Tiejun, Zhang, Shanghang
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.17801
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917029121884160
author Luo, Yulin
Fan, Chun-Kai
Dong, Menghang
Shi, Jiayu
Zhao, Mengdi
Zhang, Bo-Wen
Chi, Cheng
Liu, Jiaming
Dai, Gaole
Zhang, Rongyu
An, Ruichuan
Wu, Kun
Che, Zhengping
Xie, Shaoxuan
Yao, Guocai
Zhao, Zhongxia
Wang, Pengwei
Liu, Guang
Wang, Zhongyuan
Huang, Tiejun
Zhang, Shanghang
author_facet Luo, Yulin
Fan, Chun-Kai
Dong, Menghang
Shi, Jiayu
Zhao, Mengdi
Zhang, Bo-Wen
Chi, Cheng
Liu, Jiaming
Dai, Gaole
Zhang, Rongyu
An, Ruichuan
Wu, Kun
Che, Zhengping
Xie, Shaoxuan
Yao, Guocai
Zhao, Zhongxia
Wang, Pengwei
Liu, Guang
Wang, Zhongyuan
Huang, Tiejun
Zhang, Shanghang
contents Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential. Yet existing benchmarks emphasize execution success, or when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the critical roles across the full manipulation pipeline, RoboBench defines five dimensions-instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis-spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, and multi-view scenes, drawing from large-scale real robotic data. For planning, RoboBench introduces an evaluation framework, MLLM-as-world-simulator. It evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes. Experiments on 14 MLLMs reveal fundamental limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. RoboBench provides a comprehensive scaffold to quantify high-level cognition, and guide the development of next-generation embodied MLLMs. The project page is in https://robo-bench.github.io.
format Preprint
id arxiv_https___arxiv_org_abs_2510_17801
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain
Luo, Yulin
Fan, Chun-Kai
Dong, Menghang
Shi, Jiayu
Zhao, Mengdi
Zhang, Bo-Wen
Chi, Cheng
Liu, Jiaming
Dai, Gaole
Zhang, Rongyu
An, Ruichuan
Wu, Kun
Che, Zhengping
Xie, Shaoxuan
Yao, Guocai
Zhao, Zhongxia
Wang, Pengwei
Liu, Guang
Wang, Zhongyuan
Huang, Tiejun
Zhang, Shanghang
Robotics
Computer Vision and Pattern Recognition
Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential. Yet existing benchmarks emphasize execution success, or when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the critical roles across the full manipulation pipeline, RoboBench defines five dimensions-instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis-spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, and multi-view scenes, drawing from large-scale real robotic data. For planning, RoboBench introduces an evaluation framework, MLLM-as-world-simulator. It evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes. Experiments on 14 MLLMs reveal fundamental limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. RoboBench provides a comprehensive scaffold to quantify high-level cognition, and guide the development of next-generation embodied MLLMs. The project page is in https://robo-bench.github.io.
title Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain
topic Robotics
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2510.17801