Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cui, Jin, Long, Xinyue, Zhang, Xunyong, Zhang, Yadong, Su, Chuanchang, Gan, Jingye, Zhao, Boran, Ren, Pengju
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2605.07106
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918489149669376
author	Cui, Jin Long, Xinyue Zhang, Xunyong Zhang, Yadong Su, Chuanchang Gan, Jingye Zhao, Boran Ren, Pengju
author_facet	Cui, Jin Long, Xinyue Zhang, Xunyong Zhang, Yadong Su, Chuanchang Gan, Jingye Zhao, Boran Ren, Pengju
contents	Multimodal Large Language Models (MLLMs) have made remarkable progress on vision-language reasoning, yet most methods still compress visual evidence into discrete textual thoughts, creating an information bottleneck for fine-grained perception. Recent latent visual reasoning methods attempt to reason in continuous hidden states, but we find that they suffer from insufficient manifold compatibility: latent trajectories drift away from pretrained reasoning circuits, collapse into instance-agnostic patterns, and are often bypassed during answer generation. To address these issues, we propose RIS (Retrieve, Integrate, and Synthesize), a spatial-semantic grounded framework that develops latent reasoning as a compatible extension of pretrained MLLM computation. We first construct a step-wise grounded reasoning dataset with bounding boxes and region-specific semantic descriptions. Built on this supervision, RIS anchors latent tokens to both spatial and semantic evidence, enforces their causal role through a progressive attention bottleneck, and introduces short language transition tokens to bridge synthesized latent states back to vocabulary-aligned decoding. Experiments on V*, HRBench4K, HRBench8K, MMVP, and BLINK show consistent improvements over closed/open-source and latent reasoning baselines. Further analyses demonstrate that RIS learns diverse, interpretable, and progressively integrated latent trajectories, offering a practical path toward faithful internal visual reasoning in MLLMs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_07106
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning Cui, Jin Long, Xinyue Zhang, Xunyong Zhang, Yadong Su, Chuanchang Gan, Jingye Zhao, Boran Ren, Pengju Computation and Language Multimodal Large Language Models (MLLMs) have made remarkable progress on vision-language reasoning, yet most methods still compress visual evidence into discrete textual thoughts, creating an information bottleneck for fine-grained perception. Recent latent visual reasoning methods attempt to reason in continuous hidden states, but we find that they suffer from insufficient manifold compatibility: latent trajectories drift away from pretrained reasoning circuits, collapse into instance-agnostic patterns, and are often bypassed during answer generation. To address these issues, we propose RIS (Retrieve, Integrate, and Synthesize), a spatial-semantic grounded framework that develops latent reasoning as a compatible extension of pretrained MLLM computation. We first construct a step-wise grounded reasoning dataset with bounding boxes and region-specific semantic descriptions. Built on this supervision, RIS anchors latent tokens to both spatial and semantic evidence, enforces their causal role through a progressive attention bottleneck, and introduces short language transition tokens to bridge synthesized latent states back to vocabulary-aligned decoding. Experiments on V*, HRBench4K, HRBench8K, MMVP, and BLINK show consistent improvements over closed/open-source and latent reasoning baselines. Further analyses demonstrate that RIS learns diverse, interpretable, and progressively integrated latent trajectories, offering a practical path toward faithful internal visual reasoning in MLLMs.
title	Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
topic	Computation and Language
url	https://arxiv.org/abs/2605.07106

Similar Items