Saved in:
Bibliographic Details
Main Authors: Du, Zongcai, Deng, Guilin, Guo, Xiaofeng, Gao, Xin, Li, Linke, Cheng, Kaichang, Han, Fubo, Yang, Siyu, Liu, Peng, Zhong, Pan, Fu, Qiang
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.09016
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911336721547264
author Du, Zongcai
Deng, Guilin
Guo, Xiaofeng
Gao, Xin
Li, Linke
Cheng, Kaichang
Han, Fubo
Yang, Siyu
Liu, Peng
Zhong, Pan
Fu, Qiang
author_facet Du, Zongcai
Deng, Guilin
Guo, Xiaofeng
Gao, Xin
Li, Linke
Cheng, Kaichang
Han, Fubo
Yang, Siyu
Liu, Peng
Zhong, Pan
Fu, Qiang
contents Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS.
format Preprint
id arxiv_https___arxiv_org_abs_2510_09016
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment
Du, Zongcai
Deng, Guilin
Guo, Xiaofeng
Gao, Xin
Li, Linke
Cheng, Kaichang
Han, Fubo
Yang, Siyu
Liu, Peng
Zhong, Pan
Fu, Qiang
Sound
Artificial Intelligence
Audio and Speech Processing
Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS.
title DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment
topic Sound
Artificial Intelligence
Audio and Speech Processing
url https://arxiv.org/abs/2510.09016