Saved in:
| Main Authors: | Xu, Jilan, Thomé, Carl, Horak, Danijela, Xie, Weidi, Zisserman, Andrew |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.18010 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Synchformer: Efficient Synchronization from Sparse Cues
by: Iashin, Vladimir, et al.
Published: (2024)
by: Iashin, Vladimir, et al.
Published: (2024)
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
by: Sun, Luoyi, et al.
Published: (2026)
by: Sun, Luoyi, et al.
Published: (2026)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
by: Sun, Luoyi, et al.
Published: (2023)
by: Sun, Luoyi, et al.
Published: (2023)
Resonate: Reinforcing Text-to-Audio Generation via Online Feedback from Large Audio Language Models
by: Li, Xiquan, et al.
Published: (2026)
by: Li, Xiquan, et al.
Published: (2026)
Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling
by: Korbar, Bruno, et al.
Published: (2024)
by: Korbar, Bruno, et al.
Published: (2024)
AudioRAG+: Feedback-driven Retrieval-augmented Audio Generation with Large Audio Language Models
by: Zhao, Junqi, et al.
Published: (2025)
by: Zhao, Junqi, et al.
Published: (2025)
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval
by: Yoo, HaeJun, et al.
Published: (2026)
by: Yoo, HaeJun, et al.
Published: (2026)
Causal Tracing of Audio-Text Fusion in Large Audio Language Models
by: Chen, Wei-Chih, et al.
Published: (2026)
by: Chen, Wei-Chih, et al.
Published: (2026)
MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations
by: Guo, Wenxiang, et al.
Published: (2025)
by: Guo, Wenxiang, et al.
Published: (2025)
Bridging Language Gaps in Audio-Text Retrieval
by: Yan, Zhiyong, et al.
Published: (2024)
by: Yan, Zhiyong, et al.
Published: (2024)
TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling
by: Xie, Hao-Hui, et al.
Published: (2026)
by: Xie, Hao-Hui, et al.
Published: (2026)
When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models
by: Li, Chen-An, et al.
Published: (2025)
by: Li, Chen-An, et al.
Published: (2025)
Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models
by: Xiong, Zhen, et al.
Published: (2025)
by: Xiong, Zhen, et al.
Published: (2025)
Bypassing Direct Reconstruction: Speech Detection from MEG via Large-Scale Audio Retrieval
by: Xiao, Boda, et al.
Published: (2026)
by: Xiao, Boda, et al.
Published: (2026)
A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval
by: Oncescu, Andreea-Maria, et al.
Published: (2024)
by: Oncescu, Andreea-Maria, et al.
Published: (2024)
UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization
by: Yang, Dongchao, et al.
Published: (2026)
by: Yang, Dongchao, et al.
Published: (2026)
StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak
by: Li, Hongyi, et al.
Published: (2025)
by: Li, Hongyi, et al.
Published: (2025)
Audio Hallucination Attacks: Probing the Reliability of Large Audio Language Models
by: Seth, Ashish, et al.
Published: (2026)
by: Seth, Ashish, et al.
Published: (2026)
ChronosAudio: A Comprehensive Long-Audio Benchmark for Evaluating Audio-Large Language Models
by: Luo, Kaiwen, et al.
Published: (2026)
by: Luo, Kaiwen, et al.
Published: (2026)
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
by: Zhao, Tong, et al.
Published: (2026)
by: Zhao, Tong, et al.
Published: (2026)
AudioKV: KV Cache Eviction in Efficient Large Audio Language Models
by: Wang, Yuxuan, et al.
Published: (2026)
by: Wang, Yuxuan, et al.
Published: (2026)
Character-aware audio-visual subtitling in context
by: Huh, Jaesung, et al.
Published: (2024)
by: Huh, Jaesung, et al.
Published: (2024)
Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation
by: Rahimi, Akam, et al.
Published: (2025)
by: Rahimi, Akam, et al.
Published: (2025)
A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models
by: Christop, Iwona, et al.
Published: (2026)
by: Christop, Iwona, et al.
Published: (2026)
Retrieval-Augmented Text-to-Audio Generation
by: Yuan, Yi, et al.
Published: (2023)
by: Yuan, Yi, et al.
Published: (2023)
Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model
by: Chen, Gehui, et al.
Published: (2024)
by: Chen, Gehui, et al.
Published: (2024)
PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description
by: Zheng, Zihao, et al.
Published: (2025)
by: Zheng, Zihao, et al.
Published: (2025)
Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval
by: Tsubaki, Shunsuke, et al.
Published: (2024)
by: Tsubaki, Shunsuke, et al.
Published: (2024)
From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-modal Understanding in Multimodal LLMs
by: Jia, Yuhang, et al.
Published: (2025)
by: Jia, Yuhang, et al.
Published: (2025)
SAR-LM: Symbolic Audio Reasoning with Large Language Models
by: Taheri, Termeh, et al.
Published: (2025)
by: Taheri, Termeh, et al.
Published: (2025)
Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model
by: Ma, Ziyang, et al.
Published: (2025)
by: Ma, Ziyang, et al.
Published: (2025)
ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood
by: Feng, Tiantian, et al.
Published: (2026)
by: Feng, Tiantian, et al.
Published: (2026)
MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows
by: Li, Xiquan, et al.
Published: (2025)
by: Li, Xiquan, et al.
Published: (2025)
SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing
by: Ma, Ziyang, et al.
Published: (2026)
by: Ma, Ziyang, et al.
Published: (2026)
Can Audio Large Language Models Verify Speaker Identity?
by: Ren, Yiming, et al.
Published: (2025)
by: Ren, Yiming, et al.
Published: (2025)
DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval
by: Xin, Yifei, et al.
Published: (2024)
by: Xin, Yifei, et al.
Published: (2024)
Focus Then Listen: Exploring Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models
by: Yin, Han, et al.
Published: (2026)
by: Yin, Han, et al.
Published: (2026)
Words at Play: Benchmarking Audio Pun Understanding in Large Audio-Language Models
by: Su, Yuchen, et al.
Published: (2026)
by: Su, Yuchen, et al.
Published: (2026)
Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning
by: Lee, Kuan-Yi, et al.
Published: (2025)
by: Lee, Kuan-Yi, et al.
Published: (2025)
AeroGPT: Leveraging Large-Scale Audio Model for Aero-Engine Bearing Fault Diagnosis
by: Liu, Jiale, et al.
Published: (2025)
by: Liu, Jiale, et al.
Published: (2025)
Similar Items
-
Synchformer: Efficient Synchronization from Sparse Cues
by: Iashin, Vladimir, et al.
Published: (2024) -
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
by: Sun, Luoyi, et al.
Published: (2026) -
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
by: Sun, Luoyi, et al.
Published: (2023) -
Resonate: Reinforcing Text-to-Audio Generation via Online Feedback from Large Audio Language Models
by: Li, Xiquan, et al.
Published: (2026) -
Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling
by: Korbar, Bruno, et al.
Published: (2024)