Saved in:
Bibliographic Details
Main Authors: Yuan, Jiangye, Kumar, Gowri, Wang, Baoyuan
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.08592
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918467278471168
author Yuan, Jiangye
Kumar, Gowri
Wang, Baoyuan
author_facet Yuan, Jiangye
Kumar, Gowri
Wang, Baoyuan
contents While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach yields substantial improvements on challenging spatial reasoning benchmarks, boosting GPT-5 performance by 9% on VSI-Bench and 12% on MindCube. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.
format Preprint
id arxiv_https___arxiv_org_abs_2603_08592
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations
Yuan, Jiangye
Kumar, Gowri
Wang, Baoyuan
Computer Vision and Pattern Recognition
While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach yields substantial improvements on challenging spatial reasoning benchmarks, boosting GPT-5 performance by 9% on VSI-Bench and 12% on MindCube. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.
title Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.08592