Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Mao, Junyuan, Li, Qiankun, Meng, Linghao, He, Zhicheng, Zhou, Xinliang, Wang, Kun, Liu, Yang, Jin, Yueming
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Computer Vision and Pattern Recognition
Online-Zugang:	https://arxiv.org/abs/2603.08800
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866908875032100864
author	Mao, Junyuan Li, Qiankun Meng, Linghao He, Zhicheng Zhou, Xinliang Wang, Kun Liu, Yang Jin, Yueming
author_facet	Mao, Junyuan Li, Qiankun Meng, Linghao He, Zhicheng Zhou, Xinliang Wang, Kun Liu, Yang Jin, Yueming
contents	Recent advances in multimodal large language models largely rely on CLIP-based visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified "pixel-to-fine-to-coarse" reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by ~30% and reduces hallucination by ~20%, outperforming all visual encoders under identical settings.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_08800
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM Mao, Junyuan Li, Qiankun Meng, Linghao He, Zhicheng Zhou, Xinliang Wang, Kun Liu, Yang Jin, Yueming Computer Vision and Pattern Recognition Recent advances in multimodal large language models largely rely on CLIP-based visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified "pixel-to-fine-to-coarse" reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by ~30% and reduces hallucination by ~20%, outperforming all visual encoders under identical settings.
title	Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2603.08800

Ähnliche Einträge