Saved in:
Bibliographic Details
Main Authors: Chou, Hsing-Hang, Lin, Yun-Shao, Sung, Ching-Chin, Tsao, Yu, Lee, Chi-Chun
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.03636
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • The human voice conveys not just words but also emotional states and individuality. Emotional voice conversion (EVC) modifies emotional expressions while preserving linguistic content and speaker identity, improving applications like human-machine interaction. While deep learning has advanced EVC models for specific target speakers on well-crafted emotional datasets, existing methods often face issues with emotion accuracy and speech distortion. In addition, the zero-shot scenario, in which emotion conversion is applied to unseen speakers, remains underexplored. This work introduces a novel diffusion framework with disentangled mechanisms and expressive guidance, trained on a large emotional speech dataset and evaluated on unseen speakers across in-domain and out-of-domain datasets. Experimental results show that our method produces expressive speech with high emotional accuracy, naturalness, and quality, showcasing its potential for broader EVC applications.