Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ai, Jiabao, Zhao, Minghui, Ragni, Anton
Format:	Preprint
Published:	2026
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2603.14032
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one process. Even in its one-shot degenerate form, our framework achieves 3.37% WER vs. 4.38% for Grad-TTS with improved UTMOSv2 on LJSpeech. The full iterative UDD variant further enables adaptive prosody, autonomously inserting natural pauses in out-of-distribution slow speech rather than stretching uniformly. Audio samples are available at https://anonymousinterpseech.github.io/TTS_Demo/.

Similar Items