Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zhang, Xin, Tao, Qiqi, Du, Jiawei, Liu, Moyun, Zhou, Joey Tianyi
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Machine Learning
Online-Zugang:	https://arxiv.org/abs/2605.02735
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866913087194398720
author	Zhang, Xin Tao, Qiqi Du, Jiawei Liu, Moyun Zhou, Joey Tianyi
author_facet	Zhang, Xin Tao, Qiqi Du, Jiawei Liu, Moyun Zhou, Joey Tianyi
contents	Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in existing latent visual reasoning methods: although visual latents become semantically enriched during training, their contribution to final answer prediction is systematically suppressed. Within the shared parameter space, the autoregressive objective favors shortcut reliance on direct visual input, driving latent tokens toward transition-like states rather than informative reasoning content. We term this phenomenon Silenced Visual Latents. To address it, we disentangle the two conflicting objectives by directly optimizing the latent reasoning at inference time, keeping backbone parameters frozen. In Stage I, visual latents are warmed up via query-guided contrastive latent--visual alignment, improving semantic quality while preventing latent collapse. In Stage II, the latent reasoning is further optimized via a confidence-progression reward, which incentivizes predicted token distributions along the latent span to become progressively more concentrated, routing predictions through the latent reasoning rather than bypassing it. Experiments across eight benchmarks and four model backbones show that inference-time latent optimization, without any parameter updates, effectively unleashes the suppressed reasoning capacity of visual latents.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_02735
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs Zhang, Xin Tao, Qiqi Du, Jiawei Liu, Moyun Zhou, Joey Tianyi Machine Learning Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in existing latent visual reasoning methods: although visual latents become semantically enriched during training, their contribution to final answer prediction is systematically suppressed. Within the shared parameter space, the autoregressive objective favors shortcut reliance on direct visual input, driving latent tokens toward transition-like states rather than informative reasoning content. We term this phenomenon Silenced Visual Latents. To address it, we disentangle the two conflicting objectives by directly optimizing the latent reasoning at inference time, keeping backbone parameters frozen. In Stage I, visual latents are warmed up via query-guided contrastive latent--visual alignment, improving semantic quality while preventing latent collapse. In Stage II, the latent reasoning is further optimized via a confidence-progression reward, which incentivizes predicted token distributions along the latent span to become progressively more concentrated, routing predictions through the latent reasoning rather than bypassing it. Experiments across eight benchmarks and four model backbones show that inference-time latent optimization, without any parameter updates, effectively unleashes the suppressed reasoning capacity of visual latents.
title	Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
topic	Machine Learning
url	https://arxiv.org/abs/2605.02735

Ähnliche Einträge