Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yuan, Jiangye, Kumar, Gowri, Wang, Baoyuan
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.08592
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918467278471168
author	Yuan, Jiangye Kumar, Gowri Wang, Baoyuan
author_facet	Yuan, Jiangye Kumar, Gowri Wang, Baoyuan
contents	While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach yields substantial improvements on challenging spatial reasoning benchmarks, boosting GPT-5 performance by 9% on VSI-Bench and 12% on MindCube. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_08592
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations Yuan, Jiangye Kumar, Gowri Wang, Baoyuan Computer Vision and Pattern Recognition While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach yields substantial improvements on challenging spatial reasoning benchmarks, boosting GPT-5 performance by 9% on VSI-Bench and 12% on MindCube. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.
title	Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2603.08592

Similar Items