Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kong, Lingyu, Zhang, Hongzhi, Zhang, Jingyuan, Huang, Jianzhao, Li, Kunze, Wang, Qi, Zhang, Fuzheng
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2505.15529
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916749484490752
author	Kong, Lingyu Zhang, Hongzhi Zhang, Jingyuan Huang, Jianzhao Li, Kunze Wang, Qi Zhang, Fuzheng
author_facet	Kong, Lingyu Zhang, Hongzhi Zhang, Jingyuan Huang, Jianzhao Li, Kunze Wang, Qi Zhang, Fuzheng
contents	Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications. Designing VLMs for video inputs requires effectively modeling the temporal dimension (i.e. capturing dependencies across frames) and balancing the processing of short and long videos. Specifically, short videos demand preservation of fine-grained details, whereas long videos require strategic compression of visual information to handle extensive temporal contexts efficiently. However, our empirical analysis reveals a critical limitation: most existing VLMs suffer severe performance degradation in long video understanding tasks when compressing visual tokens below a quarter of their original visual tokens. To enable more effective modeling of both short and long video inputs, we propose Clapper, a method that utilizes a slow-fast strategy for video representation and introduces a novel module named TimePerceiver for efficient temporal-spatial encoding within existing VLM backbones. By using our method, we achieves 13x compression of visual tokens per frame (averaging 61 tokens/frame) without compromising QA accuracy. In our experiments, Clapper achieves 62.0% on VideoMME, 69.8% on MLVU, and 67.4% on TempCompass, all with fewer than 6,000 visual tokens per video. The code will be publicly available on the homepage.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_15529
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Clapper: Compact Learning and Video Representation in VLMs Kong, Lingyu Zhang, Hongzhi Zhang, Jingyuan Huang, Jianzhao Li, Kunze Wang, Qi Zhang, Fuzheng Computer Vision and Pattern Recognition Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications. Designing VLMs for video inputs requires effectively modeling the temporal dimension (i.e. capturing dependencies across frames) and balancing the processing of short and long videos. Specifically, short videos demand preservation of fine-grained details, whereas long videos require strategic compression of visual information to handle extensive temporal contexts efficiently. However, our empirical analysis reveals a critical limitation: most existing VLMs suffer severe performance degradation in long video understanding tasks when compressing visual tokens below a quarter of their original visual tokens. To enable more effective modeling of both short and long video inputs, we propose Clapper, a method that utilizes a slow-fast strategy for video representation and introduces a novel module named TimePerceiver for efficient temporal-spatial encoding within existing VLM backbones. By using our method, we achieves 13x compression of visual tokens per frame (averaging 61 tokens/frame) without compromising QA accuracy. In our experiments, Clapper achieves 62.0% on VideoMME, 69.8% on MLVU, and 67.4% on TempCompass, all with fewer than 6,000 visual tokens per video. The code will be publicly available on the homepage.
title	Clapper: Compact Learning and Video Representation in VLMs
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2505.15529

Similar Items