Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Leang, Sotheara, Augusma, Anderson, Castelli, Eric, Letué, Frédérique, Sam, Sethserey, Vaufreydaz, Dominique
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Signal Processing
Online Access:	https://arxiv.org/abs/2409.15882
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912042806411264
author	Leang, Sotheara Augusma, Anderson Castelli, Eric Letué, Frédérique Sam, Sethserey Vaufreydaz, Dominique
author_facet	Leang, Sotheara Augusma, Anderson Castelli, Eric Letué, Frédérique Sam, Sethserey Vaufreydaz, Dominique
contents	Human speech conveys prosody, linguistic content, and speaker identity. This article investigates a novel speaker anonymization approach using an end-to-end network based on a Vector-Quantized Variational Auto-Encoder (VQ-VAE) to deal with these speech components. This approach is designed to disentangle these components to specifically target and modify the speaker identity while preserving the linguistic and emotionalcontent. To do so, three separate branches compute embeddings for content, prosody, and speaker identity respectively. During synthesis, taking these embeddings, the decoder of the proposed architecture is conditioned on both speaker and prosody information, allowing for capturing more nuanced emotional states and precise adjustments to speaker identification. Findings indicate that this method outperforms most baseline techniques in preserving emotional information. However, it exhibits more limited performance on other voice privacy tasks, emphasizing the need for further improvements.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_15882
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization Leang, Sotheara Augusma, Anderson Castelli, Eric Letué, Frédérique Sam, Sethserey Vaufreydaz, Dominique Computer Vision and Pattern Recognition Signal Processing Human speech conveys prosody, linguistic content, and speaker identity. This article investigates a novel speaker anonymization approach using an end-to-end network based on a Vector-Quantized Variational Auto-Encoder (VQ-VAE) to deal with these speech components. This approach is designed to disentangle these components to specifically target and modify the speaker identity while preserving the linguistic and emotionalcontent. To do so, three separate branches compute embeddings for content, prosody, and speaker identity respectively. During synthesis, taking these embeddings, the decoder of the proposed architecture is conditioned on both speaker and prosody information, allowing for capturing more nuanced emotional states and precise adjustments to speaker identification. Findings indicate that this method outperforms most baseline techniques in preserving emotional information. However, it exhibits more limited performance on other voice privacy tasks, emphasizing the need for further improvements.
title	Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization
topic	Computer Vision and Pattern Recognition Signal Processing
url	https://arxiv.org/abs/2409.15882

Similar Items