Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Mao, Junyuan, Li, Qiankun, Meng, Linghao, He, Zhicheng, Zhou, Xinliang, Wang, Kun, Liu, Yang, Jin, Yueming
Format: Preprint
Veröffentlicht: 2026
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2603.08800
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866908875032100864
author Mao, Junyuan
Li, Qiankun
Meng, Linghao
He, Zhicheng
Zhou, Xinliang
Wang, Kun
Liu, Yang
Jin, Yueming
author_facet Mao, Junyuan
Li, Qiankun
Meng, Linghao
He, Zhicheng
Zhou, Xinliang
Wang, Kun
Liu, Yang
Jin, Yueming
contents Recent advances in multimodal large language models largely rely on CLIP-based visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified "pixel-to-fine-to-coarse" reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by ~30% and reduces hallucination by ~20%, outperforming all visual encoders under identical settings.
format Preprint
id arxiv_https___arxiv_org_abs_2603_08800
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
Mao, Junyuan
Li, Qiankun
Meng, Linghao
He, Zhicheng
Zhou, Xinliang
Wang, Kun
Liu, Yang
Jin, Yueming
Computer Vision and Pattern Recognition
Recent advances in multimodal large language models largely rely on CLIP-based visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified "pixel-to-fine-to-coarse" reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by ~30% and reduces hallucination by ~20%, outperforming all visual encoders under identical settings.
title Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.08800