Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Gong, Xuan, Huang, Hanbo, Zheng, Hao, Zhang, Yiran, Dai, Wenbin, Zhao, Weishu, Liang, Shiyu
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Computer Vision and Pattern Recognition
Online-Zugang:	https://arxiv.org/abs/2605.09614
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866911669619261440
author	Gong, Xuan Huang, Hanbo Zheng, Hao Zhang, Yiran Dai, Wenbin Zhao, Weishu Liang, Shiyu
author_facet	Gong, Xuan Huang, Hanbo Zheng, Hao Zhang, Yiran Dai, Wenbin Zhao, Weishu Liang, Shiyu
contents	Long chain-of-thought (CoT) reasoning improves large vision--language models, but visual information often fades during generation, limiting long-horizon multimodal reasoning. Existing methods either re-inject vision at inference or train policies for stronger grounding, but where to intervene relies on perception heuristics rather than principled gain analysis, and how local visual influence propagates remains implicit. We study this problem from an information-theoretic standpoint and derive a lower bound on the downstream visual gain of a one-step intervention, which suggests two factors: local branching room (token entropy) and downstream visual propagation potential (suffix divergence from a vision-marginalized reference). Guided by this analysis, we propose reflection-anchor policy optimization (RAPO), a GRPO-based policy optimization method that selects high-entropy reflection anchors and optimizes a chain-masked finite-window KL surrogate for downstream visual dependence. Experiments on reasoning-intensive and general-domain benchmarks show that RAPO delivers substantial gains over strong baselines across multiple LVLM backbones. Mechanism analyses further indicate that reflection anchors are enriched for visually sensitive decision points and that RAPO increases contrastive visual-dependence signals along generated trajectories.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_09614
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning Gong, Xuan Huang, Hanbo Zheng, Hao Zhang, Yiran Dai, Wenbin Zhao, Weishu Liang, Shiyu Computer Vision and Pattern Recognition Long chain-of-thought (CoT) reasoning improves large vision--language models, but visual information often fades during generation, limiting long-horizon multimodal reasoning. Existing methods either re-inject vision at inference or train policies for stronger grounding, but where to intervene relies on perception heuristics rather than principled gain analysis, and how local visual influence propagates remains implicit. We study this problem from an information-theoretic standpoint and derive a lower bound on the downstream visual gain of a one-step intervention, which suggests two factors: local branching room (token entropy) and downstream visual propagation potential (suffix divergence from a vision-marginalized reference). Guided by this analysis, we propose reflection-anchor policy optimization (RAPO), a GRPO-based policy optimization method that selects high-entropy reflection anchors and optimizes a chain-masked finite-window KL surrogate for downstream visual dependence. Experiments on reasoning-intensive and general-domain benchmarks show that RAPO delivers substantial gains over strong baselines across multiple LVLM backbones. Mechanism analyses further indicate that reflection anchors are enriched for visually sensitive decision points and that RAPO increases contrastive visual-dependence signals along generated trajectories.
title	Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2605.09614

Ähnliche Einträge