Saved in:
Bibliographic Details
Main Authors: Wu, Yuhao, Song, Maojia, Lan, Yihuai, Wang, Lei, Hu, Zhiqiang, Xiao, Yao, Zhou, Heng, Zheng, Weihua, Raharja, Dylan, Poria, Soujanya, Lee, Roy Ka-Wei
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.21015
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915814169378816
author Wu, Yuhao
Song, Maojia
Lan, Yihuai
Wang, Lei
Hu, Zhiqiang
Xiao, Yao
Zhou, Heng
Zheng, Weihua
Raharja, Dylan
Poria, Soujanya
Lee, Roy Ka-Wei
author_facet Wu, Yuhao
Song, Maojia
Lan, Yihuai
Wang, Lei
Hu, Zhiqiang
Xiao, Yao
Zhou, Heng
Zheng, Weihua
Raharja, Dylan
Poria, Soujanya
Lee, Roy Ka-Wei
contents Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.
format Preprint
id arxiv_https___arxiv_org_abs_2602_21015
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle From Perception to Action: An Interactive Benchmark for Vision Reasoning
Wu, Yuhao
Song, Maojia
Lan, Yihuai
Wang, Lei
Hu, Zhiqiang
Xiao, Yao
Zhou, Heng
Zheng, Weihua
Raharja, Dylan
Poria, Soujanya
Lee, Roy Ka-Wei
Computer Vision and Pattern Recognition
Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.
title From Perception to Action: An Interactive Benchmark for Vision Reasoning
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2602.21015