Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yang, Chih-Kai, Tsai, Yun-Shao, Guo, Yu-Kai, Tsai, Ping-Le, Piao, Yen-Ting, Chen, Hung-Wei, Hsiao, Ting-Lin, Hsu, Yun-Man, Lu, Ke-Han, Lee, Hung-yi
Format:	Preprint
Published:	2026
Subjects:	Sound Artificial Intelligence Computation and Language Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2603.09714
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.

Similar Items