Saved in:
Bibliographic Details
Main Authors: Lin, Weiwei, He, Chenghan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.01084
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909490320769024
author Lin, Weiwei
He, Chenghan
author_facet Lin, Weiwei
He, Chenghan
contents We propose a novel autoregressive modeling approach for speech synthesis, combining a variational autoencoder (VAE) with a multi-modal latent space and an autoregressive model that uses Gaussian Mixture Models (GMM) as the conditional probability distribution. Unlike previous methods that rely on residual vector quantization, our model leverages continuous speech representations from the VAE's latent space, greatly simplifying the training and inference pipelines. We also introduce a stochastic monotonic alignment mechanism to enforce strict monotonic alignments. Our approach significantly outperforms the state-of-the-art autoregressive model VALL-E in both subjective and objective evaluations, achieving these results with only 10.3\% of VALL-E's parameters. This demonstrates the potential of continuous speech language models as a more efficient alternative to existing quantization-based speech language models. Sample audio can be found at https://tinyurl.com/gmm-lm-tts.
format Preprint
id arxiv_https___arxiv_org_abs_2502_01084
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis
Lin, Weiwei
He, Chenghan
Machine Learning
Sound
Audio and Speech Processing
We propose a novel autoregressive modeling approach for speech synthesis, combining a variational autoencoder (VAE) with a multi-modal latent space and an autoregressive model that uses Gaussian Mixture Models (GMM) as the conditional probability distribution. Unlike previous methods that rely on residual vector quantization, our model leverages continuous speech representations from the VAE's latent space, greatly simplifying the training and inference pipelines. We also introduce a stochastic monotonic alignment mechanism to enforce strict monotonic alignments. Our approach significantly outperforms the state-of-the-art autoregressive model VALL-E in both subjective and objective evaluations, achieving these results with only 10.3\% of VALL-E's parameters. This demonstrates the potential of continuous speech language models as a more efficient alternative to existing quantization-based speech language models. Sample audio can be found at https://tinyurl.com/gmm-lm-tts.
title Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis
topic Machine Learning
Sound
Audio and Speech Processing
url https://arxiv.org/abs/2502.01084