Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Junjie, Liu, Xuyang, Wen, Zichen, Wang, Yiyu, Huang, Siteng, Chen, Honggang
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2509.01552
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917293013860352
author	Chen, Junjie Liu, Xuyang Wen, Zichen Wang, Yiyu Huang, Siteng Chen, Honggang
author_facet	Chen, Junjie Liu, Xuyang Wen, Zichen Wang, Yiyu Huang, Siteng Chen, Honggang
contents	Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, consequently leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency without architectural changes. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which critically hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a dynamic token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks consistently demonstrate that V$^2$Drop maintains \textbf{94.0\%} and \textbf{98.6\%} of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_01552
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Variation-aware Vision Token Dropping for Faster Large Vision-Language Models Chen, Junjie Liu, Xuyang Wen, Zichen Wang, Yiyu Huang, Siteng Chen, Honggang Computer Vision and Pattern Recognition Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, consequently leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency without architectural changes. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which critically hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a dynamic token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks consistently demonstrate that V$^2$Drop maintains \textbf{94.0\%} and \textbf{98.6\%} of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}.
title	Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2509.01552

Similar Items