Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ai, Jiabao, Zhao, Minghui, Ragni, Anton
Format:	Preprint
Published:	2026
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2603.14032
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915863114809344
author	Ai, Jiabao Zhao, Minghui Ragni, Anton
author_facet	Ai, Jiabao Zhao, Minghui Ragni, Anton
contents	Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one process. Even in its one-shot degenerate form, our framework achieves 3.37% WER vs. 4.38% for Grad-TTS with improved UTMOSv2 on LJSpeech. The full iterative UDD variant further enables adaptive prosody, autonomously inserting natural pauses in out-of-distribution slow speech rather than stretching uniformly. Audio samples are available at https://anonymousinterpseech.github.io/TTS_Demo/.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_14032
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion Ai, Jiabao Zhao, Minghui Ragni, Anton Audio and Speech Processing Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one process. Even in its one-shot degenerate form, our framework achieves 3.37% WER vs. 4.38% for Grad-TTS with improved UTMOSv2 on LJSpeech. The full iterative UDD variant further enables adaptive prosody, autonomously inserting natural pauses in out-of-distribution slow speech rather than stretching uniformly. Audio samples are available at https://anonymousinterpseech.github.io/TTS_Demo/.
title	Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2603.14032

Similar Items