Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Gong, Chao, Wang, Depeng, Wei, Zhipeng, Guo, Ya, Zhu, Huijia, Chen, Jingjing
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2512.10324
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918243378135040
author	Gong, Chao Wang, Depeng Wei, Zhipeng Guo, Ya Zhu, Huijia Chen, Jingjing
author_facet	Gong, Chao Wang, Depeng Wei, Zhipeng Guo, Ya Zhu, Huijia Chen, Jingjing
contents	Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational overhead from massive audio and video tokens. Token reduction, while extensively explored for video-only LLMs, is insufficient for the audio-visual domain, as these unimodal methods cannot leverage audio-visual cross-modal synergies. Furthermore, the distinct and dynamic information densities of audio and video render static budgets per modality suboptimal. How to perform token reduction on a joint audio-visual stream thus remains an unaddressed bottleneck. To fill this gap, we introduce EchoingPixels, a framework inspired by the coexistence and interaction of visuals and sound in real-world scenes. The core of our framework is the Cross-Modal Semantic Sieve (CS2), a module enabling early audio-visual interaction. Instead of compressing modalities independently, CS2 co-attends to the joint multimodal stream and reduces tokens from an entire combined pool of audio-visual tokens rather than using fixed budgets per modality. This single-pool approach allows it to adaptively allocate the token budget across both modalities and dynamically identify salient tokens in concert. To ensure this aggressive reduction preserves the vital temporal modeling capability, we co-design a Synchronization-Augmented RoPE (Sync-RoPE) to maintain critical temporal relationships for the sparsely selected tokens. Extensive experiments demonstrate that EchoingPixels achieves performance comparable to strong baselines using only 5-20% of the original tokens, with a 2-3x speedup and memory reduction.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_10324
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs Gong, Chao Wang, Depeng Wei, Zhipeng Guo, Ya Zhu, Huijia Chen, Jingjing Computer Vision and Pattern Recognition Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational overhead from massive audio and video tokens. Token reduction, while extensively explored for video-only LLMs, is insufficient for the audio-visual domain, as these unimodal methods cannot leverage audio-visual cross-modal synergies. Furthermore, the distinct and dynamic information densities of audio and video render static budgets per modality suboptimal. How to perform token reduction on a joint audio-visual stream thus remains an unaddressed bottleneck. To fill this gap, we introduce EchoingPixels, a framework inspired by the coexistence and interaction of visuals and sound in real-world scenes. The core of our framework is the Cross-Modal Semantic Sieve (CS2), a module enabling early audio-visual interaction. Instead of compressing modalities independently, CS2 co-attends to the joint multimodal stream and reduces tokens from an entire combined pool of audio-visual tokens rather than using fixed budgets per modality. This single-pool approach allows it to adaptively allocate the token budget across both modalities and dynamically identify salient tokens in concert. To ensure this aggressive reduction preserves the vital temporal modeling capability, we co-design a Synchronization-Augmented RoPE (Sync-RoPE) to maintain critical temporal relationships for the sparsely selected tokens. Extensive experiments demonstrate that EchoingPixels achieves performance comparable to strong baselines using only 5-20% of the original tokens, with a 2-3x speedup and memory reduction.
title	EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2512.10324

Similar Items