Saved in:
Bibliographic Details
Main Authors: Leang, Sotheara, Augusma, Anderson, Castelli, Eric, Letué, Frédérique, Sam, Sethserey, Vaufreydaz, Dominique
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.15882
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912042806411264
author Leang, Sotheara
Augusma, Anderson
Castelli, Eric
Letué, Frédérique
Sam, Sethserey
Vaufreydaz, Dominique
author_facet Leang, Sotheara
Augusma, Anderson
Castelli, Eric
Letué, Frédérique
Sam, Sethserey
Vaufreydaz, Dominique
contents Human speech conveys prosody, linguistic content, and speaker identity. This article investigates a novel speaker anonymization approach using an end-to-end network based on a Vector-Quantized Variational Auto-Encoder (VQ-VAE) to deal with these speech components. This approach is designed to disentangle these components to specifically target and modify the speaker identity while preserving the linguistic and emotionalcontent. To do so, three separate branches compute embeddings for content, prosody, and speaker identity respectively. During synthesis, taking these embeddings, the decoder of the proposed architecture is conditioned on both speaker and prosody information, allowing for capturing more nuanced emotional states and precise adjustments to speaker identification. Findings indicate that this method outperforms most baseline techniques in preserving emotional information. However, it exhibits more limited performance on other voice privacy tasks, emphasizing the need for further improvements.
format Preprint
id arxiv_https___arxiv_org_abs_2409_15882
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization
Leang, Sotheara
Augusma, Anderson
Castelli, Eric
Letué, Frédérique
Sam, Sethserey
Vaufreydaz, Dominique
Computer Vision and Pattern Recognition
Signal Processing
Human speech conveys prosody, linguistic content, and speaker identity. This article investigates a novel speaker anonymization approach using an end-to-end network based on a Vector-Quantized Variational Auto-Encoder (VQ-VAE) to deal with these speech components. This approach is designed to disentangle these components to specifically target and modify the speaker identity while preserving the linguistic and emotionalcontent. To do so, three separate branches compute embeddings for content, prosody, and speaker identity respectively. During synthesis, taking these embeddings, the decoder of the proposed architecture is conditioned on both speaker and prosody information, allowing for capturing more nuanced emotional states and precise adjustments to speaker identification. Findings indicate that this method outperforms most baseline techniques in preserving emotional information. However, it exhibits more limited performance on other voice privacy tasks, emphasizing the need for further improvements.
title Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization
topic Computer Vision and Pattern Recognition
Signal Processing
url https://arxiv.org/abs/2409.15882