Saved in:
| Main Authors: | Lee, Taehan, Jung, Jaehan, Lee, Hyukjun |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.03855 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SAM: A Mamba-2 State-Space Audio-Language Model
by: Lee, Taehan, et al.
Published: (2025)
by: Lee, Taehan, et al.
Published: (2025)
Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance
by: Lee, Taehan, et al.
Published: (2025)
by: Lee, Taehan, et al.
Published: (2025)
Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs
by: Yin, Han, et al.
Published: (2025)
by: Yin, Han, et al.
Published: (2025)
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024)
by: Kim, Jaeyeon, et al.
Published: (2024)
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval
by: Yoo, HaeJun, et al.
Published: (2026)
by: Yoo, HaeJun, et al.
Published: (2026)
PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation
by: Xie, Tianxin, et al.
Published: (2025)
by: Xie, Tianxin, et al.
Published: (2025)
Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning
by: Lee, Kuan-Yi, et al.
Published: (2025)
by: Lee, Kuan-Yi, et al.
Published: (2025)
DeFT-Mamba: Universal Multichannel Sound Separation and Polyphonic Audio Classification
by: Lee, Dongheon, et al.
Published: (2024)
by: Lee, Dongheon, et al.
Published: (2024)
Causal Tracing of Audio-Text Fusion in Large Audio Language Models
by: Chen, Wei-Chih, et al.
Published: (2026)
by: Chen, Wei-Chih, et al.
Published: (2026)
Unified Audio Event Detection
by: Jiang, Yidi, et al.
Published: (2024)
by: Jiang, Yidi, et al.
Published: (2024)
MUKA: Multi Kernel Audio Adaptation Of Audio-Language Models
by: Bensaid, Reda, et al.
Published: (2026)
by: Bensaid, Reda, et al.
Published: (2026)
Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs
by: Zhang, Linhao, et al.
Published: (2026)
by: Zhang, Linhao, et al.
Published: (2026)
AudioMotionBench: Evaluating Auditory Motion Perception in Audio LLMs
by: Sun, Zhe, et al.
Published: (2025)
by: Sun, Zhe, et al.
Published: (2025)
When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models
by: Li, Chen-An, et al.
Published: (2025)
by: Li, Chen-An, et al.
Published: (2025)
SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information
by: Yang, Chih-Kai, et al.
Published: (2025)
by: Yang, Chih-Kai, et al.
Published: (2025)
Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning
by: Kuan, Chun-Yi, et al.
Published: (2024)
by: Kuan, Chun-Yi, et al.
Published: (2024)
AudioBERT: Audio Knowledge Augmented Language Model
by: Ok, Hyunjong, et al.
Published: (2024)
by: Ok, Hyunjong, et al.
Published: (2024)
Focus Then Listen: Exploring Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models
by: Yin, Han, et al.
Published: (2026)
by: Yin, Han, et al.
Published: (2026)
FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching
by: Jung, Chaeyoung, et al.
Published: (2024)
by: Jung, Chaeyoung, et al.
Published: (2024)
Improving Audio Event Recognition with Consistency Regularization
by: Sadhu, Shanmuka, et al.
Published: (2025)
by: Sadhu, Shanmuka, et al.
Published: (2025)
Deformable Audio Transformer for Audio Event Detection
by: Zhu, Wentao
Published: (2023)
by: Zhu, Wentao
Published: (2023)
MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model
by: Tao, Ye, et al.
Published: (2025)
by: Tao, Ye, et al.
Published: (2025)
AudioSet-R: A Refined AudioSet with Multi-Stage LLM Label Reannotation
by: Sun, Yulin, et al.
Published: (2025)
by: Sun, Yulin, et al.
Published: (2025)
AudioScene: Integrating Object-Event Audio into 3D Scenes
by: Yuan, Shuaihang, et al.
Published: (2025)
by: Yuan, Shuaihang, et al.
Published: (2025)
MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video
by: Tateishi, Kazuya, et al.
Published: (2026)
by: Tateishi, Kazuya, et al.
Published: (2026)
Codec-Robust Attacks on Audio LLMs
by: Roh, Jaechul, et al.
Published: (2026)
by: Roh, Jaechul, et al.
Published: (2026)
ChronosAudio: A Comprehensive Long-Audio Benchmark for Evaluating Audio-Large Language Models
by: Luo, Kaiwen, et al.
Published: (2026)
by: Luo, Kaiwen, et al.
Published: (2026)
A2SB: Audio-to-Audio Schrodinger Bridges
by: Kong, Zhifeng, et al.
Published: (2025)
by: Kong, Zhifeng, et al.
Published: (2025)
AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models
by: Yang, Chih-Kai, et al.
Published: (2025)
by: Yang, Chih-Kai, et al.
Published: (2025)
TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling
by: Xie, Hao-Hui, et al.
Published: (2026)
by: Xie, Hao-Hui, et al.
Published: (2026)
Towards Weakly Supervised Text-to-Audio Grounding
by: Xu, Xuenan, et al.
Published: (2024)
by: Xu, Xuenan, et al.
Published: (2024)
Audio-Language Datasets of Scenes and Events: A Survey
by: Wijngaard, Gijs, et al.
Published: (2024)
by: Wijngaard, Gijs, et al.
Published: (2024)
Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependant
by: Tailleur, Modan, et al.
Published: (2024)
by: Tailleur, Modan, et al.
Published: (2024)
AudioSpa: Spatializing Sound Events with Text
by: Feng, Linfeng, et al.
Published: (2025)
by: Feng, Linfeng, et al.
Published: (2025)
PoolingVQ: A VQVAE Variant for Reducing Audio Redundancy and Boosting Multi-Modal Fusion in Music Emotion Analysis
by: Zou, Dinghao, et al.
Published: (2025)
by: Zou, Dinghao, et al.
Published: (2025)
Audio Language Model for Deepfake Detection Grounded in Acoustic Chain-of-Thought
by: Chen, Runkun, et al.
Published: (2026)
by: Chen, Runkun, et al.
Published: (2026)
AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs
by: He, Peize, et al.
Published: (2025)
by: He, Peize, et al.
Published: (2025)
LLM-Codec: Neural Audio Codec Meets Language Model Objectives
by: Chung, Ho-Lam, et al.
Published: (2026)
by: Chung, Ho-Lam, et al.
Published: (2026)
UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization
by: Yang, Dongchao, et al.
Published: (2026)
by: Yang, Dongchao, et al.
Published: (2026)
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024)
by: Kim, Jaeyeon, et al.
Published: (2024)
Similar Items
-
SAM: A Mamba-2 State-Space Audio-Language Model
by: Lee, Taehan, et al.
Published: (2025) -
Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance
by: Lee, Taehan, et al.
Published: (2025) -
Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs
by: Yin, Han, et al.
Published: (2025) -
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024) -
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval
by: Yoo, HaeJun, et al.
Published: (2026)