Sommario: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Du, Zongcai, Deng, Guilin, Guo, Xiaofeng, Gao, Xin, Li, Linke, Cheng, Kaichang, Han, Fubo, Yang, Siyu, Liu, Peng, Zhong, Pan, Fu, Qiang
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Sound Artificial Intelligence Audio and Speech Processing
Accesso online:	https://arxiv.org/abs/2510.09016
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

Sommario:

Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS.

Documenti analoghi