Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Yi, Xia, Youya, Wang, Yong, Song, Meng, Wu, Xin, Wan, Wenjun, Liu, Bingbing, Ye, AiXue, Zhang, Hongbo, Wen, Feng
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.00409
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

While Multimodal Large Language Models (MLLMs) excel in semantic tasks, they frequently lack the "spatial sense" essential for sophisticated geometric reasoning. Current models typically suffer from exorbitant modality-alignment costs and deficiency in fine-grained structural modeling precision.We introduce SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism. To minimize training overhead, our framework anchors 3D geometric features to the large language model's pre-aligned 2D visual semantics through cross-modal addition and token interleaving, effectively obviating the necessity for large-scale alignment pre-training. To underpin complex spatial reasoning, we propose a novel scene graph generation pipeline that represents global layouts as a chain of independent local triplets defined by relative coordinates. This is complemented by an incremental generation algorithm, enabling the model to construct "language-model-friendly" structural scaffolds for complex environments. Furthermore, we extend these capabilities to global-scale 3D global grounding task, achieving absolute metric precision across heterogeneous data sources. At a 7B parameter scale, SSR achieves state-of-the-art performance on multiple spatial intelligence benchmarks, notably scoring 73.9 on VSI-Bench. Our approach significantly outperforms much larger models, demonstrating that efficient feature alignment and structured scene reasoning are the cornerstones of authentic spatial intelligence.

Similar Items