Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Deng, Yuchen, Cai, Zidang, Zheng, Hai-Tao, Wang, Jie, Yang, Feidiao, Han, Yuxing
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Artificial Intelligence
Online-Zugang:	https://arxiv.org/abs/2605.12056
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866918496970997760
author	Deng, Yuchen Cai, Zidang Zheng, Hai-Tao Wang, Jie Yang, Feidiao Han, Yuxing
author_facet	Deng, Yuchen Cai, Zidang Zheng, Hai-Tao Wang, Jie Yang, Feidiao Han, Yuxing
contents	Omnimodal large language models (Omni-LLMs) show strong capability in audio-video understanding, but their practical deployment remains limited by high inference cost of long video streams and dense audio sequences. Despite recent progress, existing compression methods for Omni-LLMs typically rely on fixed or native compression units, which can disrupt cross-modal correspondence and the complementary information required for audio-video reasoning, making it difficult to improve inference efficiency while stably preserving performance. To address this, we propose OmniRefine, a training-free two-stage framework for efficient audio-visual token compression in Omni-LLMs. First, Correspondence-Preserving Chunk Refinement refines native chunk boundaries into cross-modally aligned compression units through frame-audio similarity and dynamic programming. Second, Modality-Aware Cooperative Compression jointly compresses video and audio tokens within each refined unit to reduce redundancy while preserving critical evidence. Extensive experiments show that OmniRefine achieves a better efficiency-performance trade-off than strong baselines and maintains stable performance under lower compression ratios. On WorldSense, it still reaches 46.7% accuracy at a 44% token retention ratio, nearly matching the full-token baseline. The code and interface will be released to facilitate further research.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_12056
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models Deng, Yuchen Cai, Zidang Zheng, Hai-Tao Wang, Jie Yang, Feidiao Han, Yuxing Artificial Intelligence Omnimodal large language models (Omni-LLMs) show strong capability in audio-video understanding, but their practical deployment remains limited by high inference cost of long video streams and dense audio sequences. Despite recent progress, existing compression methods for Omni-LLMs typically rely on fixed or native compression units, which can disrupt cross-modal correspondence and the complementary information required for audio-video reasoning, making it difficult to improve inference efficiency while stably preserving performance. To address this, we propose OmniRefine, a training-free two-stage framework for efficient audio-visual token compression in Omni-LLMs. First, Correspondence-Preserving Chunk Refinement refines native chunk boundaries into cross-modally aligned compression units through frame-audio similarity and dynamic programming. Second, Modality-Aware Cooperative Compression jointly compresses video and audio tokens within each refined unit to reduce redundancy while preserving critical evidence. Extensive experiments show that OmniRefine achieves a better efficiency-performance trade-off than strong baselines and maintains stable performance under lower compression ratios. On WorldSense, it still reaches 46.7% accuracy at a 44% token retention ratio, nearly matching the full-token baseline. The code and interface will be released to facilitate further research.
title	OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
topic	Artificial Intelligence
url	https://arxiv.org/abs/2605.12056

Ähnliche Einträge