Saved in:
Bibliographic Details
Main Authors: Zhang, Junyang, Yuan, Mu, Zhong, Ruiguang, Luo, Puhan, Zhan, Huiyou, Zhang, Ningkang, Hu, Chengchen, Li, Xiangyang
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.14846
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909481647996928
author Zhang, Junyang
Yuan, Mu
Zhong, Ruiguang
Luo, Puhan
Zhan, Huiyou
Zhang, Ningkang
Hu, Chengchen
Li, Xiangyang
author_facet Zhang, Junyang
Yuan, Mu
Zhong, Ruiguang
Luo, Puhan
Zhan, Huiyou
Zhang, Ningkang
Hu, Chengchen
Li, Xiangyang
contents The Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. However, these models demand extensive resources during inference. Adaptive attention techniques can dynamically reduce computational redundancy and thus improve efficiency. Although current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models, they are not tailored for LVLMs. We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns. This observation inspires us to manage the attention for each modality separately. Specifically, for visual input, we store the cache of potentially useful information but only compute the most critical parts. For language input, we care more about local information. Based on our observation and analysis of vision-language attention patterns, we develop A-VL, a plug-and-play adaptive attention tailored for LVLM inference. Extensive evaluations on three vision-language tasks and five datasets show the effectiveness of our designs. Our approach A-VL outperforms existing adaptive attention methods in reducing memory usage and computational load without compromising performance.
format Preprint
id arxiv_https___arxiv_org_abs_2409_14846
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle A-VL: Adaptive Attention for Large Vision-Language Models
Zhang, Junyang
Yuan, Mu
Zhong, Ruiguang
Luo, Puhan
Zhan, Huiyou
Zhang, Ningkang
Hu, Chengchen
Li, Xiangyang
Artificial Intelligence
Computer Vision and Pattern Recognition
The Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. However, these models demand extensive resources during inference. Adaptive attention techniques can dynamically reduce computational redundancy and thus improve efficiency. Although current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models, they are not tailored for LVLMs. We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns. This observation inspires us to manage the attention for each modality separately. Specifically, for visual input, we store the cache of potentially useful information but only compute the most critical parts. For language input, we care more about local information. Based on our observation and analysis of vision-language attention patterns, we develop A-VL, a plug-and-play adaptive attention tailored for LVLM inference. Extensive evaluations on three vision-language tasks and five datasets show the effectiveness of our designs. Our approach A-VL outperforms existing adaptive attention methods in reducing memory usage and computational load without compromising performance.
title A-VL: Adaptive Attention for Large Vision-Language Models
topic Artificial Intelligence
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2409.14846