Saved in:
| Main Authors: | Zhang, Linhao, Song, Yuhan, Liu, Aiwei, Wu, Chuhan, Zhang, Sijun, Jia, Wei, Liu, Yuan, Wang, Houfeng, Zhou, Xiao |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.12506 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
by: Song, Yuhan, et al.
Published: (2025)
by: Song, Yuhan, et al.
Published: (2025)
Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs
by: Zhang, Wenyu, et al.
Published: (2025)
by: Zhang, Wenyu, et al.
Published: (2025)
MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
by: Zhang, Wenyu, et al.
Published: (2024)
by: Zhang, Wenyu, et al.
Published: (2024)
AR&D: A Framework for Retrieving and Describing Concepts for Interpreting AudioLLMs
by: Chowdhury, Townim Faisal, et al.
Published: (2026)
by: Chowdhury, Townim Faisal, et al.
Published: (2026)
UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities
by: Xu, Xuenan, et al.
Published: (2025)
by: Xu, Xuenan, et al.
Published: (2025)
Beyond Transcripts: A Renewed Perspective on Audio Chaptering
by: Retkowski, Fabian, et al.
Published: (2026)
by: Retkowski, Fabian, et al.
Published: (2026)
Zero-Shot Cognitive Impairment Detection from Speech Using AudioLLM
by: Shahin, Mostafa, et al.
Published: (2025)
by: Shahin, Mostafa, et al.
Published: (2025)
UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens
by: Liu, Chengwei, et al.
Published: (2025)
by: Liu, Chengwei, et al.
Published: (2025)
Towards Explicit Acoustic Evidence Perception in Audio LLMs for Speech Deepfake Detection
by: Guo, Xiaoxuan, et al.
Published: (2026)
by: Guo, Xiaoxuan, et al.
Published: (2026)
UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization
by: Yang, Dongchao, et al.
Published: (2026)
by: Yang, Dongchao, et al.
Published: (2026)
AudioMotionBench: Evaluating Auditory Motion Perception in Audio LLMs
by: Sun, Zhe, et al.
Published: (2025)
by: Sun, Zhe, et al.
Published: (2025)
Multi-Task Instruction Tuning via Data Scheduling for Low-Resource Arabic AudioLLMs
by: Bhatti, Hunzalah Hassan, et al.
Published: (2026)
by: Bhatti, Hunzalah Hassan, et al.
Published: (2026)
MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model
by: Tao, Ye, et al.
Published: (2025)
by: Tao, Ye, et al.
Published: (2025)
Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text
by: Mei, Jiahao, et al.
Published: (2026)
by: Mei, Jiahao, et al.
Published: (2026)
TART: A Comprehensive Tool for Technique-Aware Audio-to-Tab Guitar Transcription
by: Gupta, Akshaj, et al.
Published: (2025)
by: Gupta, Akshaj, et al.
Published: (2025)
MENASpeechBank: A Reference Voice Bank with Persona-Conditioned Multi-Turn Conversations for AudioLLMs
by: Ali, Zien Sheikh, et al.
Published: (2026)
by: Ali, Zien Sheikh, et al.
Published: (2026)
Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation
by: Wei, Xiangyi, et al.
Published: (2025)
by: Wei, Xiangyi, et al.
Published: (2025)
Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models
by: Li, Yanda, et al.
Published: (2026)
by: Li, Yanda, et al.
Published: (2026)
From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation
by: Su, Kun, et al.
Published: (2024)
by: Su, Kun, et al.
Published: (2024)
UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models
by: Shi, Qundong, et al.
Published: (2026)
by: Shi, Qundong, et al.
Published: (2026)
WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference
by: Liu, Aiwei, et al.
Published: (2025)
by: Liu, Aiwei, et al.
Published: (2025)
WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild
by: Zhang, Linhao, et al.
Published: (2025)
by: Zhang, Linhao, et al.
Published: (2025)
AudioX: A Unified Framework for Anything-to-Audio Generation
by: Tian, Zeyue, et al.
Published: (2025)
by: Tian, Zeyue, et al.
Published: (2025)
From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-modal Understanding in Multimodal LLMs
by: Jia, Yuhang, et al.
Published: (2025)
by: Jia, Yuhang, et al.
Published: (2025)
Emotion and Acoustics Should Agree: Cross-Level Inconsistency Analysis for Audio Deepfake Detection
by: Zhang, Jinhua, et al.
Published: (2026)
by: Zhang, Jinhua, et al.
Published: (2026)
AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing
by: Chen, William, et al.
Published: (2026)
by: Chen, William, et al.
Published: (2026)
ChronosAudio: A Comprehensive Long-Audio Benchmark for Evaluating Audio-Large Language Models
by: Luo, Kaiwen, et al.
Published: (2026)
by: Luo, Kaiwen, et al.
Published: (2026)
Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning
by: Wu, Daiqing, et al.
Published: (2026)
by: Wu, Daiqing, et al.
Published: (2026)
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
by: Chen, Yukun, et al.
Published: (2026)
by: Chen, Yukun, et al.
Published: (2026)
Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech
by: Niu, Xinlei, et al.
Published: (2025)
by: Niu, Xinlei, et al.
Published: (2025)
MiDashengLM: Efficient Audio Understanding with General Audio Captions
by: Dinkel, Heinrich, et al.
Published: (2025)
by: Dinkel, Heinrich, et al.
Published: (2025)
Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt
by: Shi, Yanfeng, et al.
Published: (2026)
by: Shi, Yanfeng, et al.
Published: (2026)
StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak
by: Li, Hongyi, et al.
Published: (2025)
by: Li, Hongyi, et al.
Published: (2025)
RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection
by: Chang, Sungkyun, et al.
Published: (2025)
by: Chang, Sungkyun, et al.
Published: (2025)
Audio-Guided Dynamic Modality Fusion with Stereo-Aware Attention for Audio-Visual Navigation
by: Li, Jia, et al.
Published: (2025)
by: Li, Jia, et al.
Published: (2025)
AudioKV: KV Cache Eviction in Efficient Large Audio Language Models
by: Wang, Yuxuan, et al.
Published: (2026)
by: Wang, Yuxuan, et al.
Published: (2026)
A Sensitivity Analysis of Multi-Event Audio Grounding in Audio LLMs
by: Lee, Taehan, et al.
Published: (2026)
by: Lee, Taehan, et al.
Published: (2026)
FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation
by: Liu, Huadai, et al.
Published: (2024)
by: Liu, Huadai, et al.
Published: (2024)
When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning
by: Mao, Ruixiang, et al.
Published: (2026)
by: Mao, Ruixiang, et al.
Published: (2026)
Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning
by: Xie, Yuankun, et al.
Published: (2026)
by: Xie, Yuankun, et al.
Published: (2026)
Similar Items
-
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
by: Song, Yuhan, et al.
Published: (2025) -
Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs
by: Zhang, Wenyu, et al.
Published: (2025) -
MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
by: Zhang, Wenyu, et al.
Published: (2024) -
AR&D: A Framework for Retrieving and Describing Concepts for Interpreting AudioLLMs
by: Chowdhury, Townim Faisal, et al.
Published: (2026) -
UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities
by: Xu, Xuenan, et al.
Published: (2025)