Saved in:
| Main Authors: | Tao, Ye, Wu, Wen, Zhang, Chao, Wu, Mengyue, Wang, Shuai, Xu, Xuenan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.20339 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities
by: Xu, Xuenan, et al.
Published: (2025)
by: Xu, Xuenan, et al.
Published: (2025)
PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description
by: Zheng, Zihao, et al.
Published: (2025)
by: Zheng, Zihao, et al.
Published: (2025)
AudioTime: A Temporally-aligned Audio-text Benchmark Dataset
by: Xie, Zeyu, et al.
Published: (2024)
by: Xie, Zeyu, et al.
Published: (2024)
Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance
by: Zhang, Yaoyun, et al.
Published: (2024)
by: Zhang, Yaoyun, et al.
Published: (2024)
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation
by: Xie, Zeyu, et al.
Published: (2024)
by: Xie, Zeyu, et al.
Published: (2024)
Towards Weakly Supervised Text-to-Audio Grounding
by: Xu, Xuenan, et al.
Published: (2024)
by: Xu, Xuenan, et al.
Published: (2024)
Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models
by: Xu, Xuenan, et al.
Published: (2024)
by: Xu, Xuenan, et al.
Published: (2024)
Enhance Temporal Relations in Audio Captioning with Sound Event Detection
by: Xie, Zeyu, et al.
Published: (2023)
by: Xie, Zeyu, et al.
Published: (2023)
CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS
by: Zheng, Zihao, et al.
Published: (2026)
by: Zheng, Zihao, et al.
Published: (2026)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
by: Sun, Luoyi, et al.
Published: (2023)
by: Sun, Luoyi, et al.
Published: (2023)
Can Audio Large Language Models Verify Speaker Identity?
by: Ren, Yiming, et al.
Published: (2025)
by: Ren, Yiming, et al.
Published: (2025)
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data
by: Xu, Xuenan, et al.
Published: (2023)
by: Xu, Xuenan, et al.
Published: (2023)
STAR: Speech-to-Audio Generation via Representation Learning
by: Xie, Zeyu, et al.
Published: (2025)
by: Xie, Zeyu, et al.
Published: (2025)
AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling
by: Ren, Yiming, et al.
Published: (2026)
by: Ren, Yiming, et al.
Published: (2026)
Efficient Audio Captioning with Encoder-Level Knowledge Distillation
by: Xu, Xuenan, et al.
Published: (2024)
by: Xu, Xuenan, et al.
Published: (2024)
Enhancing Audio Generation Diversity with Visual Information
by: Xie, Zeyu, et al.
Published: (2024)
by: Xie, Zeyu, et al.
Published: (2024)
A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds
by: Xu, Xuenan, et al.
Published: (2024)
by: Xu, Xuenan, et al.
Published: (2024)
HoliAntiSpoof: Audio LLM for Holistic Speech Anti-Spoofing
by: Xu, Xuenan, et al.
Published: (2026)
by: Xu, Xuenan, et al.
Published: (2026)
SemanticVocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents
by: Xie, Zeyu, et al.
Published: (2026)
by: Xie, Zeyu, et al.
Published: (2026)
FakeSound: Deepfake General Audio Detection
by: Xie, Zeyu, et al.
Published: (2024)
by: Xie, Zeyu, et al.
Published: (2024)
Unified Pathological Speech Analysis with Prompt Tuning
by: Yang, Fei, et al.
Published: (2024)
by: Yang, Fei, et al.
Published: (2024)
DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation
by: Li, Baihan, et al.
Published: (2024)
by: Li, Baihan, et al.
Published: (2024)
Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text
by: Mei, Jiahao, et al.
Published: (2026)
by: Mei, Jiahao, et al.
Published: (2026)
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound
by: Liu, Haohe, et al.
Published: (2024)
by: Liu, Haohe, et al.
Published: (2024)
LARA-Gen: Enabling Continuous Emotion Control for Music Generation Models via Latent Affective Representation Alignment
by: Mei, Jiahao, et al.
Published: (2025)
by: Mei, Jiahao, et al.
Published: (2025)
When Audio Generators Become Good Listeners: Generative Features for Understanding Tasks
by: Xie, Zeyu, et al.
Published: (2025)
by: Xie, Zeyu, et al.
Published: (2025)
UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization
by: Yang, Dongchao, et al.
Published: (2026)
by: Yang, Dongchao, et al.
Published: (2026)
ChronosAudio: A Comprehensive Long-Audio Benchmark for Evaluating Audio-Large Language Models
by: Luo, Kaiwen, et al.
Published: (2026)
by: Luo, Kaiwen, et al.
Published: (2026)
Unified Audio Event Detection
by: Jiang, Yidi, et al.
Published: (2024)
by: Jiang, Yidi, et al.
Published: (2024)
AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing
by: Chen, William, et al.
Published: (2026)
by: Chen, William, et al.
Published: (2026)
StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak
by: Li, Hongyi, et al.
Published: (2025)
by: Li, Hongyi, et al.
Published: (2025)
SemanticAudio: Audio Generation and Editing in Semantic Space
by: Dai, Zheqi, et al.
Published: (2026)
by: Dai, Zheqi, et al.
Published: (2026)
FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining
by: Li, Xiquan, et al.
Published: (2026)
by: Li, Xiquan, et al.
Published: (2026)
Guiding Audio Editing with Audio Language Model
by: Lan, Zitong, et al.
Published: (2025)
by: Lan, Zitong, et al.
Published: (2025)
UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens
by: Liu, Chengwei, et al.
Published: (2025)
by: Liu, Chengwei, et al.
Published: (2025)
Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning
by: Xie, Yuankun, et al.
Published: (2026)
by: Xie, Yuankun, et al.
Published: (2026)
Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs
by: Zhang, Linhao, et al.
Published: (2026)
by: Zhang, Linhao, et al.
Published: (2026)
AudioToolAgent: An Agentic Framework for Audio-Language Models
by: Wijngaard, Gijs, et al.
Published: (2025)
by: Wijngaard, Gijs, et al.
Published: (2025)
SALM: Spatial Audio Language Model with Structured Embeddings for Understanding and Editing
by: Hu, Jinbo, et al.
Published: (2025)
by: Hu, Jinbo, et al.
Published: (2025)
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
by: Yang, Dongchao, et al.
Published: (2025)
by: Yang, Dongchao, et al.
Published: (2025)
Similar Items
-
UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities
by: Xu, Xuenan, et al.
Published: (2025) -
PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description
by: Zheng, Zihao, et al.
Published: (2025) -
AudioTime: A Temporally-aligned Audio-text Benchmark Dataset
by: Xie, Zeyu, et al.
Published: (2024) -
Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance
by: Zhang, Yaoyun, et al.
Published: (2024) -
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation
by: Xie, Zeyu, et al.
Published: (2024)