Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Iliescu, Dan Andrei, Mohan, Devang Savita Ram, Teh, Tian Huey, Hodari, Zack
Format:	Preprint
Published:	2023
Subjects:	Audio and Speech Processing Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2303.09446
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910411441307648
author	Iliescu, Dan Andrei Mohan, Devang Savita Ram Teh, Tian Huey Hodari, Zack
author_facet	Iliescu, Dan Andrei Mohan, Devang Savita Ram Teh, Tian Huey Hodari, Zack
contents	We address the problem of human-in-the-loop control for generating prosody in the context of text-to-speech synthesis. Controlling prosody is challenging because existing generative models lack an efficient interface through which users can modify the output quickly and precisely. To solve this, we introduce a novel framework whereby the user provides partial inputs and the generative model generates the missing features. We propose a model that is specifically designed to encode partial prosodic features and output complete audio. We show empirically that our model displays two essential qualities of a human-in-the-loop control mechanism: efficiency and robustness. With even a very small number of input values (~4), our model enables users to improve the quality of the output significantly in terms of listener preference (4:1).
format	Preprint
id	arxiv_https___arxiv_org_abs_2303_09446
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	Controllable Prosody Generation With Partial Inputs Iliescu, Dan Andrei Mohan, Devang Savita Ram Teh, Tian Huey Hodari, Zack Audio and Speech Processing Artificial Intelligence Computation and Language Machine Learning We address the problem of human-in-the-loop control for generating prosody in the context of text-to-speech synthesis. Controlling prosody is challenging because existing generative models lack an efficient interface through which users can modify the output quickly and precisely. To solve this, we introduce a novel framework whereby the user provides partial inputs and the generative model generates the missing features. We propose a model that is specifically designed to encode partial prosodic features and output complete audio. We show empirically that our model displays two essential qualities of a human-in-the-loop control mechanism: efficiency and robustness. With even a very small number of input values (~4), our model enables users to improve the quality of the output significantly in terms of listener preference (4:1).
title	Controllable Prosody Generation With Partial Inputs
topic	Audio and Speech Processing Artificial Intelligence Computation and Language Machine Learning
url	https://arxiv.org/abs/2303.09446

Similar Items