Salvato in:
Dettagli Bibliografici
Autori principali: Yang, Chen, Lin, Guanxin, He, Youquan, Chen, Peiyao, Liu, Guanghe, Mo, Yufan, Xu, Zhouyuan, Wang, Linhao, Zhang, Guohui, Zhang, Zihang, Zeng, Shenxiang, Wang, Chen, Fan, Jiansheng
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2602.07864
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866910271107235840
author Yang, Chen
Lin, Guanxin
He, Youquan
Chen, Peiyao
Liu, Guanghe
Mo, Yufan
Xu, Zhouyuan
Wang, Linhao
Zhang, Guohui
Zhang, Zihang
Zeng, Shenxiang
Wang, Chen
Fan, Jiansheng
author_facet Yang, Chen
Lin, Guanxin
He, Youquan
Chen, Peiyao
Liu, Guanghe
Mo, Yufan
Xu, Zhouyuan
Wang, Linhao
Zhang, Guohui
Zhang, Zihang
Zeng, Shenxiang
Wang, Chen
Fan, Jiansheng
contents Spatial intelligence is crucial for vision--language models (VLMs), yet many scene-centric benchmarks evaluate unconstrained environments where a single image may admit multiple plausible 3D interpretations. We introduce SSI-Bench, a VQA benchmark for Structure-Centric Spatial Reasoning (SCSR) in constraint-governed spaces. Built from complex real-world 3D structures, it uses structural constraints from geometry, topology, and physical feasibility to make component relations more determinate from visual evidence. The benchmark contains 1,000 ranking questions spanning geometric and topological reasoning, where correct ordering requires resolving all candidate-wise 3D relations, imposing stronger demands on spatial understanding. It is created through a fully human-centered pipeline with over 400 researcher-hours of image curation, component annotation, and question design. Evaluating 31 VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Further results show that chain-of-thought reasoning brings only marginal gains, and error analysis reveals fundamental limitations in current models' spatial understanding within constraint-governed spaces. Project page: https://ssi-bench.github.io.
format Preprint
id arxiv_https___arxiv_org_abs_2602_07864
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Thinking in Structures: Evaluating Spatial Intelligence in Constraint-Governed Spaces
Yang, Chen
Lin, Guanxin
He, Youquan
Chen, Peiyao
Liu, Guanghe
Mo, Yufan
Xu, Zhouyuan
Wang, Linhao
Zhang, Guohui
Zhang, Zihang
Zeng, Shenxiang
Wang, Chen
Fan, Jiansheng
Computer Vision and Pattern Recognition
Spatial intelligence is crucial for vision--language models (VLMs), yet many scene-centric benchmarks evaluate unconstrained environments where a single image may admit multiple plausible 3D interpretations. We introduce SSI-Bench, a VQA benchmark for Structure-Centric Spatial Reasoning (SCSR) in constraint-governed spaces. Built from complex real-world 3D structures, it uses structural constraints from geometry, topology, and physical feasibility to make component relations more determinate from visual evidence. The benchmark contains 1,000 ranking questions spanning geometric and topological reasoning, where correct ordering requires resolving all candidate-wise 3D relations, imposing stronger demands on spatial understanding. It is created through a fully human-centered pipeline with over 400 researcher-hours of image curation, component annotation, and question design. Evaluating 31 VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Further results show that chain-of-thought reasoning brings only marginal gains, and error analysis reveals fundamental limitations in current models' spatial understanding within constraint-governed spaces. Project page: https://ssi-bench.github.io.
title Thinking in Structures: Evaluating Spatial Intelligence in Constraint-Governed Spaces
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2602.07864