Saved in:
Bibliographic Details
Main Authors: Su, Zunhai, Shen, Wang, Li, Linge, Chen, Zhe, Wei, Hanyu, Yu, Huangqi, Yuan, Kehong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2501.15021
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913665303707648
author Su, Zunhai
Shen, Wang
Li, Linge
Chen, Zhe
Wei, Hanyu
Yu, Huangqi
Yuan, Kehong
author_facet Su, Zunhai
Shen, Wang
Li, Linge
Chen, Zhe
Wei, Hanyu
Yu, Huangqi
Yuan, Kehong
contents Vision-language models (VLMs) show remarkable performance in multimodal tasks. However, excessively long multimodal inputs lead to oversized Key-Value (KV) caches, resulting in significant memory consumption and I/O bottlenecks. Previous KV quantization methods for Large Language Models (LLMs) may alleviate these issues but overlook the attention saliency differences of multimodal tokens, resulting in suboptimal performance. In this paper, we investigate the attention-aware token saliency patterns in VLM and propose AKVQ-VL. AKVQ-VL leverages the proposed Text-Salient Attention (TSA) and Pivot-Token-Salient Attention (PSA) patterns to adaptively allocate bit budgets. Moreover, achieving extremely low-bit quantization requires effectively addressing outliers in KV tensors. AKVQ-VL utilizes the Walsh-Hadamard transform (WHT) to construct outlier-free KV caches, thereby reducing quantization difficulty. Evaluations of 2-bit quantization on 12 long-context and multimodal tasks demonstrate that AKVQ-VL maintains or even improves accuracy, outperforming LLM-oriented methods. AKVQ-VL can reduce peak memory usage by 2.13x, support up to 3.25x larger batch sizes and 2.46x throughput.
format Preprint
id arxiv_https___arxiv_org_abs_2501_15021
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models
Su, Zunhai
Shen, Wang
Li, Linge
Chen, Zhe
Wei, Hanyu
Yu, Huangqi
Yuan, Kehong
Computation and Language
Vision-language models (VLMs) show remarkable performance in multimodal tasks. However, excessively long multimodal inputs lead to oversized Key-Value (KV) caches, resulting in significant memory consumption and I/O bottlenecks. Previous KV quantization methods for Large Language Models (LLMs) may alleviate these issues but overlook the attention saliency differences of multimodal tokens, resulting in suboptimal performance. In this paper, we investigate the attention-aware token saliency patterns in VLM and propose AKVQ-VL. AKVQ-VL leverages the proposed Text-Salient Attention (TSA) and Pivot-Token-Salient Attention (PSA) patterns to adaptively allocate bit budgets. Moreover, achieving extremely low-bit quantization requires effectively addressing outliers in KV tensors. AKVQ-VL utilizes the Walsh-Hadamard transform (WHT) to construct outlier-free KV caches, thereby reducing quantization difficulty. Evaluations of 2-bit quantization on 12 long-context and multimodal tasks demonstrate that AKVQ-VL maintains or even improves accuracy, outperforming LLM-oriented methods. AKVQ-VL can reduce peak memory usage by 2.13x, support up to 3.25x larger batch sizes and 2.46x throughput.
title AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models
topic Computation and Language
url https://arxiv.org/abs/2501.15021