Saved in:
Bibliographic Details
Main Authors: Chen, Junjie, Liu, Xuyang, Wen, Zichen, Wang, Yiyu, Huang, Siteng, Chen, Honggang
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.01552
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917293013860352
author Chen, Junjie
Liu, Xuyang
Wen, Zichen
Wang, Yiyu
Huang, Siteng
Chen, Honggang
author_facet Chen, Junjie
Liu, Xuyang
Wen, Zichen
Wang, Yiyu
Huang, Siteng
Chen, Honggang
contents Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, consequently leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency without architectural changes. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which critically hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a dynamic token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks consistently demonstrate that V$^2$Drop maintains \textbf{94.0\%} and \textbf{98.6\%} of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}.
format Preprint
id arxiv_https___arxiv_org_abs_2509_01552
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
Chen, Junjie
Liu, Xuyang
Wen, Zichen
Wang, Yiyu
Huang, Siteng
Chen, Honggang
Computer Vision and Pattern Recognition
Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, consequently leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency without architectural changes. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which critically hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a dynamic token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks consistently demonstrate that V$^2$Drop maintains \textbf{94.0\%} and \textbf{98.6\%} of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}.
title Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2509.01552