Saved in:
Bibliographic Details
Main Authors: Li, Xinqing, He, Xin, Zhang, Xindong, Cheng, Ming-Ming, Zhang, Lei, Liu, Yun
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.17320
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917420318326784
author Li, Xinqing
He, Xin
Zhang, Xindong
Cheng, Ming-Ming
Zhang, Lei
Liu, Yun
author_facet Li, Xinqing
He, Xin
Zhang, Xindong
Cheng, Ming-Ming
Zhang, Lei
Liu, Yun
contents Deploying Vision-Language Models (VLMs) under aggressive low-bit inference remains challenging because inference cost is dominated by the long visual-token prefix during prefill and the growing KV cache during autoregressive decoding. Token pruning and low-bit quantization are complementary for reducing these costs, yet naive stage-wise combinations are often brittle due to a mismatch between quantization calibration and pruning execution. We present a collaborative quantization-and-pruning framework that unifies low-bit inference and deterministic visual-token pruning in a single deployable pipeline. The framework introduces the \textbf{Q}uantization \textbf{U}nified \textbf{O}ffline \textbf{T}oken \textbf{A}llocator (\textbf{QUOTA}), which converts low-bit calibration signals into a layer-wise token allocation schedule and materializes it as a pruning recipe. Token importance is evaluated under deployed W4A4 operators with a quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk signal, enabling consistent budgeted top-$k$ selection. Experiments on standard VLM benchmarks show improved robustness over stage-wise baselines under the same low-bit regime, achieving 95.65\% average retention while retaining only 30\% of visual tokens, compared with about 94.3\% retention for representative stage-wise combinations. The code will be released.
format Preprint
id arxiv_https___arxiv_org_abs_2604_17320
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Towards Joint Quantization and Token Pruning of Vision-Language Models
Li, Xinqing
He, Xin
Zhang, Xindong
Cheng, Ming-Ming
Zhang, Lei
Liu, Yun
Computer Vision and Pattern Recognition
Deploying Vision-Language Models (VLMs) under aggressive low-bit inference remains challenging because inference cost is dominated by the long visual-token prefix during prefill and the growing KV cache during autoregressive decoding. Token pruning and low-bit quantization are complementary for reducing these costs, yet naive stage-wise combinations are often brittle due to a mismatch between quantization calibration and pruning execution. We present a collaborative quantization-and-pruning framework that unifies low-bit inference and deterministic visual-token pruning in a single deployable pipeline. The framework introduces the \textbf{Q}uantization \textbf{U}nified \textbf{O}ffline \textbf{T}oken \textbf{A}llocator (\textbf{QUOTA}), which converts low-bit calibration signals into a layer-wise token allocation schedule and materializes it as a pruning recipe. Token importance is evaluated under deployed W4A4 operators with a quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk signal, enabling consistent budgeted top-$k$ selection. Experiments on standard VLM benchmarks show improved robustness over stage-wise baselines under the same low-bit regime, achieving 95.65\% average retention while retaining only 30\% of visual tokens, compared with about 94.3\% retention for representative stage-wise combinations. The code will be released.
title Towards Joint Quantization and Token Pruning of Vision-Language Models
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2604.17320