Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2303.09446 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866910411441307648 |
|---|---|
| author | Iliescu, Dan Andrei Mohan, Devang Savita Ram Teh, Tian Huey Hodari, Zack |
| author_facet | Iliescu, Dan Andrei Mohan, Devang Savita Ram Teh, Tian Huey Hodari, Zack |
| contents | We address the problem of human-in-the-loop control for generating prosody in the context of text-to-speech synthesis. Controlling prosody is challenging because existing generative models lack an efficient interface through which users can modify the output quickly and precisely. To solve this, we introduce a novel framework whereby the user provides partial inputs and the generative model generates the missing features. We propose a model that is specifically designed to encode partial prosodic features and output complete audio. We show empirically that our model displays two essential qualities of a human-in-the-loop control mechanism: efficiency and robustness. With even a very small number of input values (~4), our model enables users to improve the quality of the output significantly in terms of listener preference (4:1). |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2303_09446 |
| institution | arXiv |
| publishDate | 2023 |
| record_format | arxiv |
| spellingShingle | Controllable Prosody Generation With Partial Inputs Iliescu, Dan Andrei Mohan, Devang Savita Ram Teh, Tian Huey Hodari, Zack Audio and Speech Processing Artificial Intelligence Computation and Language Machine Learning We address the problem of human-in-the-loop control for generating prosody in the context of text-to-speech synthesis. Controlling prosody is challenging because existing generative models lack an efficient interface through which users can modify the output quickly and precisely. To solve this, we introduce a novel framework whereby the user provides partial inputs and the generative model generates the missing features. We propose a model that is specifically designed to encode partial prosodic features and output complete audio. We show empirically that our model displays two essential qualities of a human-in-the-loop control mechanism: efficiency and robustness. With even a very small number of input values (~4), our model enables users to improve the quality of the output significantly in terms of listener preference (4:1). |
| title | Controllable Prosody Generation With Partial Inputs |
| topic | Audio and Speech Processing Artificial Intelligence Computation and Language Machine Learning |
| url | https://arxiv.org/abs/2303.09446 |