Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lyth, Dan, King, Simon
Format:	Preprint
Published:	2024
Subjects:	Sound Computation and Language Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2402.01912
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910321585684480
author	Lyth, Dan King, Simon
author_facet	Lyth, Dan King, Simon
contents	Text-to-speech models trained on large-scale datasets have demonstrated impressive in-context learning capabilities and naturalness. However, control of speaker identity and style in these models typically requires conditioning on reference speech recordings, limiting creative applications. Alternatively, natural language prompting of speaker identity and style has demonstrated promising results and provides an intuitive method of control. However, reliance on human-labeled descriptions prevents scaling to large datasets. Our work bridges the gap between these two approaches. We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions. We then apply this method to a 45k hour dataset, which we use to train a speech language model. Furthermore, we propose simple methods for increasing audio fidelity, significantly outperforming recent work despite relying entirely on found data. Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions, all accomplished with a single model and intuitive natural language conditioning. Audio samples can be heard at https://text-description-to-speech.com/.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_01912
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Natural language guidance of high-fidelity text-to-speech with synthetic annotations Lyth, Dan King, Simon Sound Computation and Language Audio and Speech Processing Text-to-speech models trained on large-scale datasets have demonstrated impressive in-context learning capabilities and naturalness. However, control of speaker identity and style in these models typically requires conditioning on reference speech recordings, limiting creative applications. Alternatively, natural language prompting of speaker identity and style has demonstrated promising results and provides an intuitive method of control. However, reliance on human-labeled descriptions prevents scaling to large datasets. Our work bridges the gap between these two approaches. We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions. We then apply this method to a 45k hour dataset, which we use to train a speech language model. Furthermore, we propose simple methods for increasing audio fidelity, significantly outperforming recent work despite relying entirely on found data. Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions, all accomplished with a single model and intuitive natural language conditioning. Audio samples can be heard at https://text-description-to-speech.com/.
title	Natural language guidance of high-fidelity text-to-speech with synthetic annotations
topic	Sound Computation and Language Audio and Speech Processing
url	https://arxiv.org/abs/2402.01912

Similar Items