Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cheng, Yanchun, Wang, Rundong, Yang, Xulei, Prakash, Alok, Rus, Daniela, Ang Jr, Marcelo H, Li, ShiJie
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.06985
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915842124414976
author	Cheng, Yanchun Wang, Rundong Yang, Xulei Prakash, Alok Rus, Daniela Ang Jr, Marcelo H Li, ShiJie
author_facet	Cheng, Yanchun Wang, Rundong Yang, Xulei Prakash, Alok Rus, Daniela Ang Jr, Marcelo H Li, ShiJie
contents	Spatial reasoning from monocular images is essential for autonomous driving, yet current Vision-Language Models (VLMs) still struggle with fine-grained geometric perception, particularly under large scale variation and ambiguous object appearance. We propose a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability. Instead of relying on textual bounding-box outputs, each referred object is represented using all Visual Reference Tokens (VRTs) within its spatial extent, enabling visual evidence and textual reasoning to be processed jointly in a unified token space. To further strengthen cross-modal interaction, we construct a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals. A deterministic ordering strategy is introduced to make supervision over inherently unordered VRT sets fully compatible with the VLM's autoregressive next-token prediction. With only standard supervised fine-tuning, our method achieves substantial improvements on the SURDS benchmark, outperforming previous approaches - including those using RL-based post-training - by a large margin across both single-object and multi-object tasks. These results demonstrate that accurate perception and multimodal reasoning are mutually reinforcing, and together form the key to robust spatial understanding in challenging monocular driving scenarios.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_06985
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Perception-Aware Multimodal Spatial Reasoning from Monocular Images Cheng, Yanchun Wang, Rundong Yang, Xulei Prakash, Alok Rus, Daniela Ang Jr, Marcelo H Li, ShiJie Computer Vision and Pattern Recognition Spatial reasoning from monocular images is essential for autonomous driving, yet current Vision-Language Models (VLMs) still struggle with fine-grained geometric perception, particularly under large scale variation and ambiguous object appearance. We propose a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability. Instead of relying on textual bounding-box outputs, each referred object is represented using all Visual Reference Tokens (VRTs) within its spatial extent, enabling visual evidence and textual reasoning to be processed jointly in a unified token space. To further strengthen cross-modal interaction, we construct a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals. A deterministic ordering strategy is introduced to make supervision over inherently unordered VRT sets fully compatible with the VLM's autoregressive next-token prediction. With only standard supervised fine-tuning, our method achieves substantial improvements on the SURDS benchmark, outperforming previous approaches - including those using RL-based post-training - by a large margin across both single-object and multi-object tasks. These results demonstrate that accurate perception and multimodal reasoning are mutually reinforcing, and together form the key to robust spatial understanding in challenging monocular driving scenarios.
title	Perception-Aware Multimodal Spatial Reasoning from Monocular Images
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2603.06985

Similar Items