Saved in:
| Main Authors: | Mei, Jiahao, Xu, Xuenan, Xie, Zeyu, Zheng, Zihao, Tao, Ye, Ding, Yue, Wu, Mengyue |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.05875 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation
by: Xie, Zeyu, et al.
Published: (2024)
by: Xie, Zeyu, et al.
Published: (2024)
PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description
by: Zheng, Zihao, et al.
Published: (2025)
by: Zheng, Zihao, et al.
Published: (2025)
UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities
by: Xu, Xuenan, et al.
Published: (2025)
by: Xu, Xuenan, et al.
Published: (2025)
STAR: Speech-to-Audio Generation via Representation Learning
by: Xie, Zeyu, et al.
Published: (2025)
by: Xie, Zeyu, et al.
Published: (2025)
Enhance Temporal Relations in Audio Captioning with Sound Event Detection
by: Xie, Zeyu, et al.
Published: (2023)
by: Xie, Zeyu, et al.
Published: (2023)
Enhancing Audio Generation Diversity with Visual Information
by: Xie, Zeyu, et al.
Published: (2024)
by: Xie, Zeyu, et al.
Published: (2024)
AudioTime: A Temporally-aligned Audio-text Benchmark Dataset
by: Xie, Zeyu, et al.
Published: (2024)
by: Xie, Zeyu, et al.
Published: (2024)
Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance
by: Zhang, Yaoyun, et al.
Published: (2024)
by: Zhang, Yaoyun, et al.
Published: (2024)
FakeSound: Deepfake General Audio Detection
by: Xie, Zeyu, et al.
Published: (2024)
by: Xie, Zeyu, et al.
Published: (2024)
CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS
by: Zheng, Zihao, et al.
Published: (2026)
by: Zheng, Zihao, et al.
Published: (2026)
SemanticVocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents
by: Xie, Zeyu, et al.
Published: (2026)
by: Xie, Zeyu, et al.
Published: (2026)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
by: Sun, Luoyi, et al.
Published: (2023)
by: Sun, Luoyi, et al.
Published: (2023)
A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds
by: Xu, Xuenan, et al.
Published: (2024)
by: Xu, Xuenan, et al.
Published: (2024)
MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model
by: Tao, Ye, et al.
Published: (2025)
by: Tao, Ye, et al.
Published: (2025)
Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text
by: Mei, Jiahao, et al.
Published: (2026)
by: Mei, Jiahao, et al.
Published: (2026)
DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation
by: Li, Baihan, et al.
Published: (2024)
by: Li, Baihan, et al.
Published: (2024)
When Audio Generators Become Good Listeners: Generative Features for Understanding Tasks
by: Xie, Zeyu, et al.
Published: (2025)
by: Xie, Zeyu, et al.
Published: (2025)
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data
by: Xu, Xuenan, et al.
Published: (2023)
by: Xu, Xuenan, et al.
Published: (2023)
Towards Weakly Supervised Text-to-Audio Grounding
by: Xu, Xuenan, et al.
Published: (2024)
by: Xu, Xuenan, et al.
Published: (2024)
FakeSound2: A Benchmark for Explainable and Generalizable Deepfake Sound Detection
by: Xie, Zeyu, et al.
Published: (2025)
by: Xie, Zeyu, et al.
Published: (2025)
Unified Pathological Speech Analysis with Prompt Tuning
by: Yang, Fei, et al.
Published: (2024)
by: Yang, Fei, et al.
Published: (2024)
Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models
by: Xu, Xuenan, et al.
Published: (2024)
by: Xu, Xuenan, et al.
Published: (2024)
Efficient Audio Captioning with Encoder-Level Knowledge Distillation
by: Xu, Xuenan, et al.
Published: (2024)
by: Xu, Xuenan, et al.
Published: (2024)
SyMuPe: Affective and Controllable Symbolic Music Performance
by: Borovik, Ilya, et al.
Published: (2025)
by: Borovik, Ilya, et al.
Published: (2025)
DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning
by: Li, Xiquan, et al.
Published: (2024)
by: Li, Xiquan, et al.
Published: (2024)
Exploring Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations
by: Sun, Yujia, et al.
Published: (2024)
by: Sun, Yujia, et al.
Published: (2024)
PhraseVAE and PhraseLDM: Latent Diffusion for Full-Song Multitrack Symbolic Music Generation
by: Ou, Longshen, et al.
Published: (2025)
by: Ou, Longshen, et al.
Published: (2025)
NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control
by: Wen, Yufan, et al.
Published: (2026)
by: Wen, Yufan, et al.
Published: (2026)
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound
by: Liu, Haohe, et al.
Published: (2024)
by: Liu, Haohe, et al.
Published: (2024)
LoopGen: Training-Free Loopable Music Generation
by: Marincione, Davide, et al.
Published: (2025)
by: Marincione, Davide, et al.
Published: (2025)
EMelodyGen: Emotion-Conditioned Melody Generation in ABC Notation with the Musical Feature Template
by: Zhou, Monan, et al.
Published: (2023)
by: Zhou, Monan, et al.
Published: (2023)
SAMUeL: Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion
by: Cheung, Hei Shing, et al.
Published: (2025)
by: Cheung, Hei Shing, et al.
Published: (2025)
YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance
by: Zheng, Junjie, et al.
Published: (2025)
by: Zheng, Junjie, et al.
Published: (2025)
Evaluating Disentangled Representations for Controllable Music Generation
by: Ibáñez-Martínez, Laura, et al.
Published: (2026)
by: Ibáñez-Martínez, Laura, et al.
Published: (2026)
Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation
by: Tong, Xinyi, et al.
Published: (2025)
by: Tong, Xinyi, et al.
Published: (2025)
Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations
by: Cho, Deok-Hyeon, et al.
Published: (2026)
by: Cho, Deok-Hyeon, et al.
Published: (2026)
Persian MusicGen: A Large-Scale Dataset and Culturally-Aware Generative Model for Persian Music
by: Sameti, Mohammad Hossein, et al.
Published: (2026)
by: Sameti, Mohammad Hossein, et al.
Published: (2026)
Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model
by: Kang, Jaeyong, et al.
Published: (2023)
by: Kang, Jaeyong, et al.
Published: (2023)
Versatile Symbolic Music-for-Music Modeling via Function Alignment
by: Jiang, Junyan, et al.
Published: (2025)
by: Jiang, Junyan, et al.
Published: (2025)
MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation
by: Lan, Yun-Han, et al.
Published: (2024)
by: Lan, Yun-Han, et al.
Published: (2024)
Similar Items
-
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation
by: Xie, Zeyu, et al.
Published: (2024) -
PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description
by: Zheng, Zihao, et al.
Published: (2025) -
UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities
by: Xu, Xuenan, et al.
Published: (2025) -
STAR: Speech-to-Audio Generation via Representation Learning
by: Xie, Zeyu, et al.
Published: (2025) -
Enhance Temporal Relations in Audio Captioning with Sound Event Detection
by: Xie, Zeyu, et al.
Published: (2023)