Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Han, Jiayi, Du, Liang, Wu, Yiwen, Zhou, Xiangguo, Du, Hongwei, Zheng, Weibo
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2501.09532
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913673695461376
author	Han, Jiayi Du, Liang Wu, Yiwen Zhou, Xiangguo Du, Hongwei Zheng, Weibo
author_facet	Han, Jiayi Du, Liang Wu, Yiwen Zhou, Xiangguo Du, Hongwei Zheng, Weibo
contents	The success of VLMs often relies on the dynamic high-resolution schema that adaptively augments the input images to multiple crops, so that the details of the images can be retained. However, such approaches result in a large number of redundant visual tokens, thus significantly reducing the efficiency of the VLMs. To improve the VLMs' efficiency without introducing extra training costs, many research works are proposed to reduce the visual tokens by filtering the uninformative visual tokens or aggregating their information. Some approaches propose to reduce the visual tokens according to the self-attention of VLMs, which are biased, to result in inaccurate responses. The token reduction approaches solely rely on visual cues are text-agnostic, and fail to focus on the areas that are most relevant to the question, especially when the queried objects are non-salient to the image. In this work, we first conduct experiments to show that the original text embeddings are aligned with the visual tokens, without bias on the tailed visual tokens. We then propose a self-adaptive cross-modality attention mixture mechanism that dynamically leverages the effectiveness of visual saliency and text-to-image similarity in the pre-LLM layers to select the visual tokens that are informative. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art training-free VLM acceleration performance, especially when the reduction rate is sufficiently large.
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_09532
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	AdaFV: Rethinking of Visual-Language alignment for VLM acceleration Han, Jiayi Du, Liang Wu, Yiwen Zhou, Xiangguo Du, Hongwei Zheng, Weibo Computer Vision and Pattern Recognition The success of VLMs often relies on the dynamic high-resolution schema that adaptively augments the input images to multiple crops, so that the details of the images can be retained. However, such approaches result in a large number of redundant visual tokens, thus significantly reducing the efficiency of the VLMs. To improve the VLMs' efficiency without introducing extra training costs, many research works are proposed to reduce the visual tokens by filtering the uninformative visual tokens or aggregating their information. Some approaches propose to reduce the visual tokens according to the self-attention of VLMs, which are biased, to result in inaccurate responses. The token reduction approaches solely rely on visual cues are text-agnostic, and fail to focus on the areas that are most relevant to the question, especially when the queried objects are non-salient to the image. In this work, we first conduct experiments to show that the original text embeddings are aligned with the visual tokens, without bias on the tailed visual tokens. We then propose a self-adaptive cross-modality attention mixture mechanism that dynamically leverages the effectiveness of visual saliency and text-to-image similarity in the pre-LLM layers to select the visual tokens that are informative. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art training-free VLM acceleration performance, especially when the reduction rate is sufficiently large.
title	AdaFV: Rethinking of Visual-Language alignment for VLM acceleration
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2501.09532

Similar Items