Saved in:
Bibliographic Details
Main Authors: Ai, Jiabao, Zhao, Minghui, Ragni, Anton
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.14032
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915863114809344
author Ai, Jiabao
Zhao, Minghui
Ragni, Anton
author_facet Ai, Jiabao
Zhao, Minghui
Ragni, Anton
contents Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one process. Even in its one-shot degenerate form, our framework achieves 3.37% WER vs. 4.38% for Grad-TTS with improved UTMOSv2 on LJSpeech. The full iterative UDD variant further enables adaptive prosody, autonomously inserting natural pauses in out-of-distribution slow speech rather than stretching uniformly. Audio samples are available at https://anonymousinterpseech.github.io/TTS_Demo/.
format Preprint
id arxiv_https___arxiv_org_abs_2603_14032
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion
Ai, Jiabao
Zhao, Minghui
Ragni, Anton
Audio and Speech Processing
Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one process. Even in its one-shot degenerate form, our framework achieves 3.37% WER vs. 4.38% for Grad-TTS with improved UTMOSv2 on LJSpeech. The full iterative UDD variant further enables adaptive prosody, autonomously inserting natural pauses in out-of-distribution slow speech rather than stretching uniformly. Audio samples are available at https://anonymousinterpseech.github.io/TTS_Demo/.
title Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion
topic Audio and Speech Processing
url https://arxiv.org/abs/2603.14032