Saved in:
| Main Authors: | He, Xiang, Li, Chenxing, Wang, Jinting, Rong, Yan, Xie, Tianxin, Wang, Wenfu, Liu, Li, Yu, Dong |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.18187 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning
by: Wu, Shu, et al.
Published: (2025)
by: Wu, Shu, et al.
Published: (2025)
AudioGenie-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning
by: Rong, Yan, et al.
Published: (2025)
by: Rong, Yan, et al.
Published: (2025)
SemanticVocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents
by: Xie, Zeyu, et al.
Published: (2026)
by: Xie, Zeyu, et al.
Published: (2026)
PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation
by: Xie, Tianxin, et al.
Published: (2025)
by: Xie, Tianxin, et al.
Published: (2025)
SRC-gAudio: Sampling-Rate-Controlled Audio Generation
by: Li, Chenxing, et al.
Published: (2024)
by: Li, Chenxing, et al.
Published: (2024)
Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models
by: Xiong, Zhen, et al.
Published: (2025)
by: Xiong, Zhen, et al.
Published: (2025)
GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment
by: Wang, Jinting, et al.
Published: (2025)
by: Wang, Jinting, et al.
Published: (2025)
Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model
by: Ma, Ziyang, et al.
Published: (2025)
by: Ma, Ziyang, et al.
Published: (2025)
VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents
by: Hu, Jiliang, et al.
Published: (2025)
by: Hu, Jiliang, et al.
Published: (2025)
When Audio Generators Become Good Listeners: Generative Features for Understanding Tasks
by: Xie, Zeyu, et al.
Published: (2025)
by: Xie, Zeyu, et al.
Published: (2025)
DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer
by: Liu, Yisu, et al.
Published: (2025)
by: Liu, Yisu, et al.
Published: (2025)
AudioRAG+: Feedback-driven Retrieval-augmented Audio Generation with Large Audio Language Models
by: Zhao, Junqi, et al.
Published: (2025)
by: Zhao, Junqi, et al.
Published: (2025)
Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards
by: Fang, Linghan, et al.
Published: (2026)
by: Fang, Linghan, et al.
Published: (2026)
AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation
by: Rong, Yan, et al.
Published: (2025)
by: Rong, Yan, et al.
Published: (2025)
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
by: Hai, Jiarui, et al.
Published: (2024)
by: Hai, Jiarui, et al.
Published: (2024)
EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning
by: Wang, Dingdong, et al.
Published: (2026)
by: Wang, Dingdong, et al.
Published: (2026)
Audio Language Model for Deepfake Detection Grounded in Acoustic Chain-of-Thought
by: Chen, Runkun, et al.
Published: (2026)
by: Chen, Runkun, et al.
Published: (2026)
Interpretable Audio Editing Evaluation via Chain-of-Thought Difference-Commonality Reasoning with Multimodal LLMs
by: Jia, Yuhang, et al.
Published: (2025)
by: Jia, Yuhang, et al.
Published: (2025)
Video-to-Audio Generation with Fine-grained Temporal Semantics
by: Hu, Yuchen, et al.
Published: (2024)
by: Hu, Yuchen, et al.
Published: (2024)
Covo-Audio Technical Report
by: Wang, Wenfu, et al.
Published: (2026)
by: Wang, Wenfu, et al.
Published: (2026)
The Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality for Audio Reasoning Models and Agents
by: Ma, Ziyang, et al.
Published: (2026)
by: Ma, Ziyang, et al.
Published: (2026)
EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering
by: Xie, Tianxin, et al.
Published: (2025)
by: Xie, Tianxin, et al.
Published: (2025)
UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation
by: Wang, Jinting, et al.
Published: (2025)
by: Wang, Jinting, et al.
Published: (2025)
Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought
by: Li, Xuanchen, et al.
Published: (2026)
by: Li, Xuanchen, et al.
Published: (2026)
Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation
by: Bai, Ye, et al.
Published: (2024)
by: Bai, Ye, et al.
Published: (2024)
PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation
by: Liu, Huadai, et al.
Published: (2025)
by: Liu, Huadai, et al.
Published: (2025)
Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding
by: Kong, Zhifeng, et al.
Published: (2025)
by: Kong, Zhifeng, et al.
Published: (2025)
AudSemThinker: Enhancing Audio-Language Models through Reasoning over Semantics of Sound
by: Wijngaard, Gijs, et al.
Published: (2025)
by: Wijngaard, Gijs, et al.
Published: (2025)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing
by: Liu, Huadai, et al.
Published: (2025)
by: Liu, Huadai, et al.
Published: (2025)
AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning
by: Tong, Siqian, et al.
Published: (2026)
by: Tong, Siqian, et al.
Published: (2026)
StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak
by: Li, Hongyi, et al.
Published: (2025)
by: Li, Hongyi, et al.
Published: (2025)
AudioRAG: A Challenging Benchmark for Audio Reasoning and Information Retrieval
by: Lin, Jingru, et al.
Published: (2026)
by: Lin, Jingru, et al.
Published: (2026)
Resonate: Reinforcing Text-to-Audio Generation via Online Feedback from Large Audio Language Models
by: Li, Xiquan, et al.
Published: (2026)
by: Li, Xiquan, et al.
Published: (2026)
RPRA-ADD: Forgery Trace Enhancement-Driven Audio Deepfake Detection
by: Fu, Ruibo, et al.
Published: (2025)
by: Fu, Ruibo, et al.
Published: (2025)
Prompt-guided Precise Audio Editing with Diffusion Models
by: Xu, Manjie, et al.
Published: (2024)
by: Xu, Manjie, et al.
Published: (2024)
Audio-Guided Dynamic Modality Fusion with Stereo-Aware Attention for Audio-Visual Navigation
by: Li, Jia, et al.
Published: (2025)
by: Li, Jia, et al.
Published: (2025)
STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment
by: Ren, Yong, et al.
Published: (2024)
by: Ren, Yong, et al.
Published: (2024)
CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions
by: Zhu, Xinfa, et al.
Published: (2025)
by: Zhu, Xinfa, et al.
Published: (2025)
Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model
by: Ren, Yong, et al.
Published: (2025)
by: Ren, Yong, et al.
Published: (2025)
Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning
by: Xie, Yuankun, et al.
Published: (2026)
by: Xie, Yuankun, et al.
Published: (2026)
Similar Items
-
Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning
by: Wu, Shu, et al.
Published: (2025) -
AudioGenie-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning
by: Rong, Yan, et al.
Published: (2025) -
SemanticVocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents
by: Xie, Zeyu, et al.
Published: (2026) -
PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation
by: Xie, Tianxin, et al.
Published: (2025) -
SRC-gAudio: Sampling-Rate-Controlled Audio Generation
by: Li, Chenxing, et al.
Published: (2024)