Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lin, Weiwei, He, Chenghan
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2502.01084
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909490320769024
author	Lin, Weiwei He, Chenghan
author_facet	Lin, Weiwei He, Chenghan
contents	We propose a novel autoregressive modeling approach for speech synthesis, combining a variational autoencoder (VAE) with a multi-modal latent space and an autoregressive model that uses Gaussian Mixture Models (GMM) as the conditional probability distribution. Unlike previous methods that rely on residual vector quantization, our model leverages continuous speech representations from the VAE's latent space, greatly simplifying the training and inference pipelines. We also introduce a stochastic monotonic alignment mechanism to enforce strict monotonic alignments. Our approach significantly outperforms the state-of-the-art autoregressive model VALL-E in both subjective and objective evaluations, achieving these results with only 10.3\% of VALL-E's parameters. This demonstrates the potential of continuous speech language models as a more efficient alternative to existing quantization-based speech language models. Sample audio can be found at https://tinyurl.com/gmm-lm-tts.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_01084
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis Lin, Weiwei He, Chenghan Machine Learning Sound Audio and Speech Processing We propose a novel autoregressive modeling approach for speech synthesis, combining a variational autoencoder (VAE) with a multi-modal latent space and an autoregressive model that uses Gaussian Mixture Models (GMM) as the conditional probability distribution. Unlike previous methods that rely on residual vector quantization, our model leverages continuous speech representations from the VAE's latent space, greatly simplifying the training and inference pipelines. We also introduce a stochastic monotonic alignment mechanism to enforce strict monotonic alignments. Our approach significantly outperforms the state-of-the-art autoregressive model VALL-E in both subjective and objective evaluations, achieving these results with only 10.3\% of VALL-E's parameters. This demonstrates the potential of continuous speech language models as a more efficient alternative to existing quantization-based speech language models. Sample audio can be found at https://tinyurl.com/gmm-lm-tts.
title	Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis
topic	Machine Learning Sound Audio and Speech Processing
url	https://arxiv.org/abs/2502.01084

Similar Items