Saved in:
Bibliographic Details
Main Authors: Iliescu, Dan Andrei, Mohan, Devang Savita Ram, Teh, Tian Huey, Hodari, Zack
Format: Preprint
Published: 2023
Subjects:
Online Access:https://arxiv.org/abs/2303.09446
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910411441307648
author Iliescu, Dan Andrei
Mohan, Devang Savita Ram
Teh, Tian Huey
Hodari, Zack
author_facet Iliescu, Dan Andrei
Mohan, Devang Savita Ram
Teh, Tian Huey
Hodari, Zack
contents We address the problem of human-in-the-loop control for generating prosody in the context of text-to-speech synthesis. Controlling prosody is challenging because existing generative models lack an efficient interface through which users can modify the output quickly and precisely. To solve this, we introduce a novel framework whereby the user provides partial inputs and the generative model generates the missing features. We propose a model that is specifically designed to encode partial prosodic features and output complete audio. We show empirically that our model displays two essential qualities of a human-in-the-loop control mechanism: efficiency and robustness. With even a very small number of input values (~4), our model enables users to improve the quality of the output significantly in terms of listener preference (4:1).
format Preprint
id arxiv_https___arxiv_org_abs_2303_09446
institution arXiv
publishDate 2023
record_format arxiv
spellingShingle Controllable Prosody Generation With Partial Inputs
Iliescu, Dan Andrei
Mohan, Devang Savita Ram
Teh, Tian Huey
Hodari, Zack
Audio and Speech Processing
Artificial Intelligence
Computation and Language
Machine Learning
We address the problem of human-in-the-loop control for generating prosody in the context of text-to-speech synthesis. Controlling prosody is challenging because existing generative models lack an efficient interface through which users can modify the output quickly and precisely. To solve this, we introduce a novel framework whereby the user provides partial inputs and the generative model generates the missing features. We propose a model that is specifically designed to encode partial prosodic features and output complete audio. We show empirically that our model displays two essential qualities of a human-in-the-loop control mechanism: efficiency and robustness. With even a very small number of input values (~4), our model enables users to improve the quality of the output significantly in terms of listener preference (4:1).
title Controllable Prosody Generation With Partial Inputs
topic Audio and Speech Processing
Artificial Intelligence
Computation and Language
Machine Learning
url https://arxiv.org/abs/2303.09446