Saved in:
Bibliographic Details
Main Authors: Kong, Lingyu, Zhang, Hongzhi, Zhang, Jingyuan, Huang, Jianzhao, Li, Kunze, Wang, Qi, Zhang, Fuzheng
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.15529
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916749484490752
author Kong, Lingyu
Zhang, Hongzhi
Zhang, Jingyuan
Huang, Jianzhao
Li, Kunze
Wang, Qi
Zhang, Fuzheng
author_facet Kong, Lingyu
Zhang, Hongzhi
Zhang, Jingyuan
Huang, Jianzhao
Li, Kunze
Wang, Qi
Zhang, Fuzheng
contents Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications. Designing VLMs for video inputs requires effectively modeling the temporal dimension (i.e. capturing dependencies across frames) and balancing the processing of short and long videos. Specifically, short videos demand preservation of fine-grained details, whereas long videos require strategic compression of visual information to handle extensive temporal contexts efficiently. However, our empirical analysis reveals a critical limitation: most existing VLMs suffer severe performance degradation in long video understanding tasks when compressing visual tokens below a quarter of their original visual tokens. To enable more effective modeling of both short and long video inputs, we propose Clapper, a method that utilizes a slow-fast strategy for video representation and introduces a novel module named TimePerceiver for efficient temporal-spatial encoding within existing VLM backbones. By using our method, we achieves 13x compression of visual tokens per frame (averaging 61 tokens/frame) without compromising QA accuracy. In our experiments, Clapper achieves 62.0% on VideoMME, 69.8% on MLVU, and 67.4% on TempCompass, all with fewer than 6,000 visual tokens per video. The code will be publicly available on the homepage.
format Preprint
id arxiv_https___arxiv_org_abs_2505_15529
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Clapper: Compact Learning and Video Representation in VLMs
Kong, Lingyu
Zhang, Hongzhi
Zhang, Jingyuan
Huang, Jianzhao
Li, Kunze
Wang, Qi
Zhang, Fuzheng
Computer Vision and Pattern Recognition
Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications. Designing VLMs for video inputs requires effectively modeling the temporal dimension (i.e. capturing dependencies across frames) and balancing the processing of short and long videos. Specifically, short videos demand preservation of fine-grained details, whereas long videos require strategic compression of visual information to handle extensive temporal contexts efficiently. However, our empirical analysis reveals a critical limitation: most existing VLMs suffer severe performance degradation in long video understanding tasks when compressing visual tokens below a quarter of their original visual tokens. To enable more effective modeling of both short and long video inputs, we propose Clapper, a method that utilizes a slow-fast strategy for video representation and introduces a novel module named TimePerceiver for efficient temporal-spatial encoding within existing VLM backbones. By using our method, we achieves 13x compression of visual tokens per frame (averaging 61 tokens/frame) without compromising QA accuracy. In our experiments, Clapper achieves 62.0% on VideoMME, 69.8% on MLVU, and 67.4% on TempCompass, all with fewer than 6,000 visual tokens per video. The code will be publicly available on the homepage.
title Clapper: Compact Learning and Video Representation in VLMs
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2505.15529