MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Yin, Chun, Chi, Tai-Shih, Tsao, Yu, Wang, Hsin-Min
Natura:	Preprint
Pubblicazione:	2024
Soggetti:	Audio and Speech Processing Machine Learning Sound
Accesso online:	https://arxiv.org/abs/2406.08445
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866916283870609408
author	Yin, Chun Chi, Tai-Shih Tsao, Yu Wang, Hsin-Min
author_facet	Yin, Chun Chi, Tai-Shih Tsao, Yu Wang, Hsin-Min
contents	Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks. However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated. In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity. Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models. In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance. Furthermore, when WavLM is replaced by other SFMs, SVSNet+ still outperforms the baseline models and exhibits strong generalization ability.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_08445
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models Yin, Chun Chi, Tai-Shih Tsao, Yu Wang, Hsin-Min Audio and Speech Processing Machine Learning Sound Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks. However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated. In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity. Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models. In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance. Furthermore, when WavLM is replaced by other SFMs, SVSNet+ still outperforms the baseline models and exhibits strong generalization ability.
title	SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models
topic	Audio and Speech Processing Machine Learning Sound
url	https://arxiv.org/abs/2406.08445

Documenti analoghi