Saved in:
| Main Authors: | Liu, Yisu, Li, Chenxing, Zhang, Wanqian, Wang, Wenfu, Yu, Meng, Fu, Ruibo, Lin, Zheng, Wang, Weiping, Yu, Dong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.13786 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning
by: Wu, Shu, et al.
Published: (2025)
by: Wu, Shu, et al.
Published: (2025)
When Audio Generators Become Good Listeners: Generative Features for Understanding Tasks
by: Xie, Zeyu, et al.
Published: (2025)
by: Xie, Zeyu, et al.
Published: (2025)
VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents
by: Hu, Jiliang, et al.
Published: (2025)
by: Hu, Jiliang, et al.
Published: (2025)
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
by: He, Xiang, et al.
Published: (2026)
by: He, Xiang, et al.
Published: (2026)
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
by: Hai, Jiarui, et al.
Published: (2024)
by: Hai, Jiarui, et al.
Published: (2024)
SRC-gAudio: Sampling-Rate-Controlled Audio Generation
by: Li, Chenxing, et al.
Published: (2024)
by: Li, Chenxing, et al.
Published: (2024)
Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning
by: Xu, Le, et al.
Published: (2025)
by: Xu, Le, et al.
Published: (2025)
SemanticVocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents
by: Xie, Zeyu, et al.
Published: (2026)
by: Xie, Zeyu, et al.
Published: (2026)
AudioRAG+: Feedback-driven Retrieval-augmented Audio Generation with Large Audio Language Models
by: Zhao, Junqi, et al.
Published: (2025)
by: Zhao, Junqi, et al.
Published: (2025)
U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation
by: Yang, Xusheng, et al.
Published: (2025)
by: Yang, Xusheng, et al.
Published: (2025)
AudioGenie-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning
by: Rong, Yan, et al.
Published: (2025)
by: Rong, Yan, et al.
Published: (2025)
Video-to-Audio Generation with Fine-grained Temporal Semantics
by: Hu, Yuchen, et al.
Published: (2024)
by: Hu, Yuchen, et al.
Published: (2024)
Audio-Guided Dynamic Modality Fusion with Stereo-Aware Attention for Audio-Visual Navigation
by: Li, Jia, et al.
Published: (2025)
by: Li, Jia, et al.
Published: (2025)
Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model
by: Ren, Yong, et al.
Published: (2025)
by: Ren, Yong, et al.
Published: (2025)
Prompt-guided Precise Audio Editing with Diffusion Models
by: Xu, Manjie, et al.
Published: (2024)
by: Xu, Manjie, et al.
Published: (2024)
GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment
by: Wang, Jinting, et al.
Published: (2025)
by: Wang, Jinting, et al.
Published: (2025)
Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation
by: Xiong, Chenxu, et al.
Published: (2024)
by: Xiong, Chenxu, et al.
Published: (2024)
M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis
by: Wang, Xiaopeng, et al.
Published: (2025)
by: Wang, Xiaopeng, et al.
Published: (2025)
RPRA-ADD: Forgery Trace Enhancement-Driven Audio Deepfake Detection
by: Fu, Ruibo, et al.
Published: (2025)
by: Fu, Ruibo, et al.
Published: (2025)
Learning Control of Neural Sound Effects Synthesis from Physically Inspired Models
by: Zong, Yisu, et al.
Published: (2025)
by: Zong, Yisu, et al.
Published: (2025)
ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation
by: Sun, Jiahui, et al.
Published: (2025)
by: Sun, Jiahui, et al.
Published: (2025)
PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation
by: Xie, Tianxin, et al.
Published: (2025)
by: Xie, Tianxin, et al.
Published: (2025)
STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment
by: Ren, Yong, et al.
Published: (2024)
by: Ren, Yong, et al.
Published: (2024)
Audio-Guided Visual Perception for Audio-Visual Navigation
by: Wang, Yi, et al.
Published: (2025)
by: Wang, Yi, et al.
Published: (2025)
P2Mark: Plug-and-play Parameter-level Watermarking for Neural Speech Generation
by: Ren, Yong, et al.
Published: (2025)
by: Ren, Yong, et al.
Published: (2025)
Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0
by: Wang, Zhiyong, et al.
Published: (2024)
by: Wang, Zhiyong, et al.
Published: (2024)
Covo-Audio Technical Report
by: Wang, Wenfu, et al.
Published: (2026)
by: Wang, Wenfu, et al.
Published: (2026)
Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation
by: Guo, Hongming, et al.
Published: (2024)
by: Guo, Hongming, et al.
Published: (2024)
Audio Palette: A Diffusion Transformer with Multi-Signal Conditioning for Controllable Foley Synthesis
by: Wang, Junnuo
Published: (2025)
by: Wang, Junnuo
Published: (2025)
LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space
by: Xin, Detai, et al.
Published: (2026)
by: Xin, Detai, et al.
Published: (2026)
Disrupting Diffusion: Token-Level Attention Erasure Attack against Diffusion-based Customization
by: Liu, Yisu, et al.
Published: (2024)
by: Liu, Yisu, et al.
Published: (2024)
Video-to-Audio Generation with Hidden Alignment
by: Xu, Manjie, et al.
Published: (2024)
by: Xu, Manjie, et al.
Published: (2024)
Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning
by: Xie, Yuankun, et al.
Published: (2026)
by: Xie, Yuankun, et al.
Published: (2026)
Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation
by: Bai, Ye, et al.
Published: (2024)
by: Bai, Ye, et al.
Published: (2024)
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
by: Liu, Kai, et al.
Published: (2025)
by: Liu, Kai, et al.
Published: (2025)
EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering
by: Xie, Tianxin, et al.
Published: (2025)
by: Xie, Tianxin, et al.
Published: (2025)
Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception
by: Xie, Yuankun, et al.
Published: (2025)
by: Xie, Yuankun, et al.
Published: (2025)
AudioMoG: Guiding Audio Generation with Mixture-of-Guidance
by: Wang, Junyou, et al.
Published: (2025)
by: Wang, Junyou, et al.
Published: (2025)
Complex Image-Generative Diffusion Transformer for Audio Denoising
by: Li, Junhui, et al.
Published: (2024)
by: Li, Junhui, et al.
Published: (2024)
SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer
by: Wang, Helin, et al.
Published: (2024)
by: Wang, Helin, et al.
Published: (2024)
Similar Items
-
Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning
by: Wu, Shu, et al.
Published: (2025) -
When Audio Generators Become Good Listeners: Generative Features for Understanding Tasks
by: Xie, Zeyu, et al.
Published: (2025) -
VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents
by: Hu, Jiliang, et al.
Published: (2025) -
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
by: He, Xiang, et al.
Published: (2026) -
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
by: Hai, Jiarui, et al.
Published: (2024)