MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Yang, Jiacheng, Chen, Anqi, Dang, Yunkai, Fan, Qi, Wang, Cong, Li, Wenbin, Miao, Feng, Gao, Yang
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Computer Vision and Pattern Recognition
Accesso online:	https://arxiv.org/abs/2602.23615
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866917320849358848
author	Yang, Jiacheng Chen, Anqi Dang, Yunkai Fan, Qi Wang, Cong Li, Wenbin Miao, Feng Gao, Yang
author_facet	Yang, Jiacheng Chen, Anqi Dang, Yunkai Fan, Qi Wang, Cong Li, Wenbin Miao, Feng Gao, Yang
contents	Current Large Multimodal Models (LMMs) struggle with high-resolution visual inputs during the reasoning process, as the number of image tokens increases quadratically with resolution, introducing substantial redundancy and irrelevant information. A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning, typically trained with external visual supervision. However, such visual supervision cues require costly grounding labels from human annotators. Meanwhile, it remains an open question how to enhance a model's grounding abilities to support reasoning without relying on additional annotations. In this paper, we propose High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs. HART incorporates a post-training paradigm in which we design Advantage Preference Group Relative Policy Optimization (AP-GRPO) to encourage accurate localization of key regions without external visual annotations. Notably, HART provides explainable reasoning pathways and enables efficient optimization of localization. Extensive experiments on MME-RealWorld-Lite, TreeBench, V* Bench, HR-Bench-4K/8K, and MMStar demonstrate that HART improves performance across a wide range of high-resolution visual tasks, consistently outperforming strong baselines.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_23615
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning Yang, Jiacheng Chen, Anqi Dang, Yunkai Fan, Qi Wang, Cong Li, Wenbin Miao, Feng Gao, Yang Computer Vision and Pattern Recognition Current Large Multimodal Models (LMMs) struggle with high-resolution visual inputs during the reasoning process, as the number of image tokens increases quadratically with resolution, introducing substantial redundancy and irrelevant information. A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning, typically trained with external visual supervision. However, such visual supervision cues require costly grounding labels from human annotators. Meanwhile, it remains an open question how to enhance a model's grounding abilities to support reasoning without relying on additional annotations. In this paper, we propose High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs. HART incorporates a post-training paradigm in which we design Advantage Preference Group Relative Policy Optimization (AP-GRPO) to encourage accurate localization of key regions without external visual annotations. Notably, HART provides explainable reasoning pathways and enables efficient optimization of localization. Extensive experiments on MME-RealWorld-Lite, TreeBench, V* Bench, HR-Bench-4K/8K, and MMStar demonstrate that HART improves performance across a wide range of high-resolution visual tasks, consistently outperforming strong baselines.
title	Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2602.23615

Documenti analoghi