Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Zhang, Chong, Liu, Yanqing, Zheng, Yang, Zhao, Sheng
Format:	Preprint
Publié:	2024
Sujets:	Audio and Speech Processing
Accès en ligne:	https://arxiv.org/abs/2406.04633
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866909219148529664
author	Zhang, Chong Liu, Yanqing Zheng, Yang Zhao, Sheng
author_facet	Zhang, Chong Liu, Yanqing Zheng, Yang Zhao, Sheng
contents	Scaling text-to-speech (TTS) with autoregressive language model (LM) to large-scale datasets by quantizing waveform into discrete speech tokens is making great progress to capture the diversity and expressiveness in human speech, but the speech reconstruction quality from discrete speech token is far from satisfaction depending on the compressed speech token compression ratio. Generative diffusion models trained with score-matching loss and continuous normalized flow trained with flow-matching loss have become prominent in generation of images as well as speech. LM based TTS systems usually quantize speech into discrete tokens and generate these tokens autoregressively, and finally use a diffusion model to up sample coarse-grained speech tokens into fine-grained codec features or mel-spectrograms before reconstructing into waveforms with vocoder, which has a high latency and is not realistic for real time speech applications. In this paper, we systematically investigate varied diffusion models for up sampling stage, which is the main bottleneck for streaming synthesis of LM and diffusion-based architecture, we present the model architecture, objective and subjective metrics to show quality and efficiency improvement.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_04633
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study Zhang, Chong Liu, Yanqing Zheng, Yang Zhao, Sheng Audio and Speech Processing Scaling text-to-speech (TTS) with autoregressive language model (LM) to large-scale datasets by quantizing waveform into discrete speech tokens is making great progress to capture the diversity and expressiveness in human speech, but the speech reconstruction quality from discrete speech token is far from satisfaction depending on the compressed speech token compression ratio. Generative diffusion models trained with score-matching loss and continuous normalized flow trained with flow-matching loss have become prominent in generation of images as well as speech. LM based TTS systems usually quantize speech into discrete tokens and generate these tokens autoregressively, and finally use a diffusion model to up sample coarse-grained speech tokens into fine-grained codec features or mel-spectrograms before reconstructing into waveforms with vocoder, which has a high latency and is not realistic for real time speech applications. In this paper, we systematically investigate varied diffusion models for up sampling stage, which is the main bottleneck for streaming synthesis of LM and diffusion-based architecture, we present the model architecture, objective and subjective metrics to show quality and efficiency improvement.
title	Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2406.04633

Documents similaires