Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.14032 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915863114809344 |
|---|---|
| author | Ai, Jiabao Zhao, Minghui Ragni, Anton |
| author_facet | Ai, Jiabao Zhao, Minghui Ragni, Anton |
| contents | Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one process. Even in its one-shot degenerate form, our framework achieves 3.37% WER vs. 4.38% for Grad-TTS with improved UTMOSv2 on LJSpeech. The full iterative UDD variant further enables adaptive prosody, autonomously inserting natural pauses in out-of-distribution slow speech rather than stretching uniformly. Audio samples are available at https://anonymousinterpseech.github.io/TTS_Demo/. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2603_14032 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion Ai, Jiabao Zhao, Minghui Ragni, Anton Audio and Speech Processing Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one process. Even in its one-shot degenerate form, our framework achieves 3.37% WER vs. 4.38% for Grad-TTS with improved UTMOSv2 on LJSpeech. The full iterative UDD variant further enables adaptive prosody, autonomously inserting natural pauses in out-of-distribution slow speech rather than stretching uniformly. Audio samples are available at https://anonymousinterpseech.github.io/TTS_Demo/. |
| title | Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion |
| topic | Audio and Speech Processing |
| url | https://arxiv.org/abs/2603.14032 |