Saved in:
| Main Authors: | , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.09016 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911336721547264 |
|---|---|
| author | Du, Zongcai Deng, Guilin Guo, Xiaofeng Gao, Xin Li, Linke Cheng, Kaichang Han, Fubo Yang, Siyu Liu, Peng Zhong, Pan Fu, Qiang |
| author_facet | Du, Zongcai Deng, Guilin Guo, Xiaofeng Gao, Xin Li, Linke Cheng, Kaichang Han, Fubo Yang, Siyu Liu, Peng Zhong, Pan Fu, Qiang |
| contents | Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2510_09016 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment Du, Zongcai Deng, Guilin Guo, Xiaofeng Gao, Xin Li, Linke Cheng, Kaichang Han, Fubo Yang, Siyu Liu, Peng Zhong, Pan Fu, Qiang Sound Artificial Intelligence Audio and Speech Processing Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS. |
| title | DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment |
| topic | Sound Artificial Intelligence Audio and Speech Processing |
| url | https://arxiv.org/abs/2510.09016 |