Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Tan, Xiaozhou, Zhao, Minghui, Ragni, Anton
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2509.18470
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912643923574784
author	Tan, Xiaozhou Zhao, Minghui Ragni, Anton
author_facet	Tan, Xiaozhou Zhao, Minghui Ragni, Anton
contents	Diffusion models have attracted a lot of attention in recent years. These models view speech generation as a continuous-time process. For efficient training, this process is typically restricted to additive Gaussian noising, which is limiting. For inference, the time is typically discretized, leading to the mismatch between continuous training and discrete sampling conditions. Recently proposed discrete-time processes, on the other hand, usually do not have these limitations, may require substantially fewer inference steps, and are fully consistent between training/inference conditions. This paper explores some diffusion-like discrete-time processes and proposes some new variants. These include processes applying additive Gaussian noise, multiplicative Gaussian noise, blurring noise and a mixture of blurring and Gaussian noises. The experimental results suggest that discrete-time processes offer comparable subjective and objective speech quality to their widely popular continuous counterpart, with more efficient and consistent training and inference schemas.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_18470
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Discrete-Time Diffusion-Like Models for Speech Synthesis Tan, Xiaozhou Zhao, Minghui Ragni, Anton Machine Learning Audio and Speech Processing Diffusion models have attracted a lot of attention in recent years. These models view speech generation as a continuous-time process. For efficient training, this process is typically restricted to additive Gaussian noising, which is limiting. For inference, the time is typically discretized, leading to the mismatch between continuous training and discrete sampling conditions. Recently proposed discrete-time processes, on the other hand, usually do not have these limitations, may require substantially fewer inference steps, and are fully consistent between training/inference conditions. This paper explores some diffusion-like discrete-time processes and proposes some new variants. These include processes applying additive Gaussian noise, multiplicative Gaussian noise, blurring noise and a mixture of blurring and Gaussian noises. The experimental results suggest that discrete-time processes offer comparable subjective and objective speech quality to their widely popular continuous counterpart, with more efficient and consistent training and inference schemas.
title	Discrete-Time Diffusion-Like Models for Speech Synthesis
topic	Machine Learning Audio and Speech Processing
url	https://arxiv.org/abs/2509.18470

Similar Items