Salvato in:
Dettagli Bibliografici
Autori principali: Lee, Donghee, Cai, Rui, Zhao, Zhe
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2601.13622
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866918411380981760
author Lee, Donghee
Cai, Rui
Zhao, Zhe
author_facet Lee, Donghee
Cai, Rui
Zhao, Zhe
contents Large vision-language models (LVLMs) are typically trained using autoregressive language modeling objectives, which align visual representations with linguistic space. While effective for multimodal reasoning, this alignment can weaken vision-centric capabilities, causing LVLMs to underperform their base vision encoders on tasks such as image classification. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a lightweight framework that integrates raw vision features with aligned LLM representations through vision-integration layers and a context-aware ensemble mechanism. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations. Extensive experiments demonstrate that CARPE improves performance on both image classification and diverse vision-language benchmarks. Our results suggest that modality balancing plays a critical role in multimodal generalization by improving representation utilization within autoregressive LVLMs.
format Preprint
id arxiv_https___arxiv_org_abs_2601_13622
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models
Lee, Donghee
Cai, Rui
Zhao, Zhe
Computer Vision and Pattern Recognition
Artificial Intelligence
Large vision-language models (LVLMs) are typically trained using autoregressive language modeling objectives, which align visual representations with linguistic space. While effective for multimodal reasoning, this alignment can weaken vision-centric capabilities, causing LVLMs to underperform their base vision encoders on tasks such as image classification. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a lightweight framework that integrates raw vision features with aligned LLM representations through vision-integration layers and a context-aware ensemble mechanism. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations. Extensive experiments demonstrate that CARPE improves performance on both image classification and diverse vision-language benchmarks. Our results suggest that modality balancing plays a critical role in multimodal generalization by improving representation utilization within autoregressive LVLMs.
title CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2601.13622