Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Xinqing, He, Xin, Zhang, Xindong, Cheng, Ming-Ming, Zhang, Lei, Liu, Yun
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.17320
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917420318326784
author	Li, Xinqing He, Xin Zhang, Xindong Cheng, Ming-Ming Zhang, Lei Liu, Yun
author_facet	Li, Xinqing He, Xin Zhang, Xindong Cheng, Ming-Ming Zhang, Lei Liu, Yun
contents	Deploying Vision-Language Models (VLMs) under aggressive low-bit inference remains challenging because inference cost is dominated by the long visual-token prefix during prefill and the growing KV cache during autoregressive decoding. Token pruning and low-bit quantization are complementary for reducing these costs, yet naive stage-wise combinations are often brittle due to a mismatch between quantization calibration and pruning execution. We present a collaborative quantization-and-pruning framework that unifies low-bit inference and deterministic visual-token pruning in a single deployable pipeline. The framework introduces the \textbf{Q}uantization \textbf{U}nified \textbf{O}ffline \textbf{T}oken \textbf{A}llocator (\textbf{QUOTA}), which converts low-bit calibration signals into a layer-wise token allocation schedule and materializes it as a pruning recipe. Token importance is evaluated under deployed W4A4 operators with a quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk signal, enabling consistent budgeted top-$k$ selection. Experiments on standard VLM benchmarks show improved robustness over stage-wise baselines under the same low-bit regime, achieving 95.65\% average retention while retaining only 30\% of visual tokens, compared with about 94.3\% retention for representative stage-wise combinations. The code will be released.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_17320
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Towards Joint Quantization and Token Pruning of Vision-Language Models Li, Xinqing He, Xin Zhang, Xindong Cheng, Ming-Ming Zhang, Lei Liu, Yun Computer Vision and Pattern Recognition Deploying Vision-Language Models (VLMs) under aggressive low-bit inference remains challenging because inference cost is dominated by the long visual-token prefix during prefill and the growing KV cache during autoregressive decoding. Token pruning and low-bit quantization are complementary for reducing these costs, yet naive stage-wise combinations are often brittle due to a mismatch between quantization calibration and pruning execution. We present a collaborative quantization-and-pruning framework that unifies low-bit inference and deterministic visual-token pruning in a single deployable pipeline. The framework introduces the \textbf{Q}uantization \textbf{U}nified \textbf{O}ffline \textbf{T}oken \textbf{A}llocator (\textbf{QUOTA}), which converts low-bit calibration signals into a layer-wise token allocation schedule and materializes it as a pruning recipe. Token importance is evaluated under deployed W4A4 operators with a quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk signal, enabling consistent budgeted top-$k$ selection. Experiments on standard VLM benchmarks show improved robustness over stage-wise baselines under the same low-bit regime, achieving 95.65\% average retention while retaining only 30\% of visual tokens, compared with about 94.3\% retention for representative stage-wise combinations. The code will be released.
title	Towards Joint Quantization and Token Pruning of Vision-Language Models
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2604.17320

Similar Items