Salvato in:
| Autori principali: | , , , , , , , , , , , , |
|---|---|
| Natura: | Preprint |
| Pubblicazione: |
2026
|
| Soggetti: | |
| Accesso online: | https://arxiv.org/abs/2602.07864 |
| Tags: |
Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
|
| _version_ | 1866910271107235840 |
|---|---|
| author | Yang, Chen Lin, Guanxin He, Youquan Chen, Peiyao Liu, Guanghe Mo, Yufan Xu, Zhouyuan Wang, Linhao Zhang, Guohui Zhang, Zihang Zeng, Shenxiang Wang, Chen Fan, Jiansheng |
| author_facet | Yang, Chen Lin, Guanxin He, Youquan Chen, Peiyao Liu, Guanghe Mo, Yufan Xu, Zhouyuan Wang, Linhao Zhang, Guohui Zhang, Zihang Zeng, Shenxiang Wang, Chen Fan, Jiansheng |
| contents | Spatial intelligence is crucial for vision--language models (VLMs), yet many scene-centric benchmarks evaluate unconstrained environments where a single image may admit multiple plausible 3D interpretations. We introduce SSI-Bench, a VQA benchmark for Structure-Centric Spatial Reasoning (SCSR) in constraint-governed spaces. Built from complex real-world 3D structures, it uses structural constraints from geometry, topology, and physical feasibility to make component relations more determinate from visual evidence. The benchmark contains 1,000 ranking questions spanning geometric and topological reasoning, where correct ordering requires resolving all candidate-wise 3D relations, imposing stronger demands on spatial understanding. It is created through a fully human-centered pipeline with over 400 researcher-hours of image curation, component annotation, and question design. Evaluating 31 VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Further results show that chain-of-thought reasoning brings only marginal gains, and error analysis reveals fundamental limitations in current models' spatial understanding within constraint-governed spaces. Project page: https://ssi-bench.github.io. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_07864 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Thinking in Structures: Evaluating Spatial Intelligence in Constraint-Governed Spaces Yang, Chen Lin, Guanxin He, Youquan Chen, Peiyao Liu, Guanghe Mo, Yufan Xu, Zhouyuan Wang, Linhao Zhang, Guohui Zhang, Zihang Zeng, Shenxiang Wang, Chen Fan, Jiansheng Computer Vision and Pattern Recognition Spatial intelligence is crucial for vision--language models (VLMs), yet many scene-centric benchmarks evaluate unconstrained environments where a single image may admit multiple plausible 3D interpretations. We introduce SSI-Bench, a VQA benchmark for Structure-Centric Spatial Reasoning (SCSR) in constraint-governed spaces. Built from complex real-world 3D structures, it uses structural constraints from geometry, topology, and physical feasibility to make component relations more determinate from visual evidence. The benchmark contains 1,000 ranking questions spanning geometric and topological reasoning, where correct ordering requires resolving all candidate-wise 3D relations, imposing stronger demands on spatial understanding. It is created through a fully human-centered pipeline with over 400 researcher-hours of image curation, component annotation, and question design. Evaluating 31 VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Further results show that chain-of-thought reasoning brings only marginal gains, and error analysis reveals fundamental limitations in current models' spatial understanding within constraint-governed spaces. Project page: https://ssi-bench.github.io. |
| title | Thinking in Structures: Evaluating Spatial Intelligence in Constraint-Governed Spaces |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2602.07864 |