Saved in:
Bibliographic Details
Main Authors: Ma, Liang, Wen, Jiajun, Lin, Min, Xu, Rongtao, Liang, Xiwen, Lin, Bingqian, Ma, Jun, Wang, Yongxin, Wei, Ziming, Lin, Haokun, Han, Mingfei, Cao, Meng, Chen, Bokui, Laptev, Ivan, Liang, Xiaodan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2506.08708
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908668727918592
author Ma, Liang
Wen, Jiajun
Lin, Min
Xu, Rongtao
Liang, Xiwen
Lin, Bingqian
Ma, Jun
Wang, Yongxin
Wei, Ziming
Lin, Haokun
Han, Mingfei
Cao, Meng
Chen, Bokui
Laptev, Ivan
Liang, Xiaodan
author_facet Ma, Liang
Wen, Jiajun
Lin, Min
Xu, Rongtao
Liang, Xiwen
Lin, Bingqian
Ma, Jun
Wang, Yongxin
Wei, Ziming
Lin, Haokun
Han, Mingfei
Cao, Meng
Chen, Bokui
Laptev, Ivan
Liang, Xiaodan
contents While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples, collectively aimed at evaluating progressive spatial reasoning and fundamental physical comprehension, including object properties, spatial relationships, and holistic scene understanding. PhyBlock includes 2600 block tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three key dimensions: partial completion, failure diagnosis, and planning robustness. We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning. Our empirical findings indicate that the performance of VLMs exhibits pronounced limitations in high-level planning and reasoning capabilities, leading to a notable decline in performance for the growing complexity of the tasks. Error analysis reveals persistent difficulties in spatial orientation and dependency reasoning. Surprisingly, chain-of-thought prompting offers minimal improvements, suggesting spatial tasks heavily rely on intuitive model comprehension. We position PhyBlock as a unified testbed to advance embodied reasoning, bridging vision-language understanding and real-world physical problem-solving.
format Preprint
id arxiv_https___arxiv_org_abs_2506_08708
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly
Ma, Liang
Wen, Jiajun
Lin, Min
Xu, Rongtao
Liang, Xiwen
Lin, Bingqian
Ma, Jun
Wang, Yongxin
Wei, Ziming
Lin, Haokun
Han, Mingfei
Cao, Meng
Chen, Bokui
Laptev, Ivan
Liang, Xiaodan
Robotics
Artificial Intelligence
Computer Vision and Pattern Recognition
While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples, collectively aimed at evaluating progressive spatial reasoning and fundamental physical comprehension, including object properties, spatial relationships, and holistic scene understanding. PhyBlock includes 2600 block tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three key dimensions: partial completion, failure diagnosis, and planning robustness. We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning. Our empirical findings indicate that the performance of VLMs exhibits pronounced limitations in high-level planning and reasoning capabilities, leading to a notable decline in performance for the growing complexity of the tasks. Error analysis reveals persistent difficulties in spatial orientation and dependency reasoning. Surprisingly, chain-of-thought prompting offers minimal improvements, suggesting spatial tasks heavily rely on intuitive model comprehension. We position PhyBlock as a unified testbed to advance embodied reasoning, bridging vision-language understanding and real-world physical problem-solving.
title PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly
topic Robotics
Artificial Intelligence
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2506.08708