Salvato in:
| Autori principali: | , , , |
|---|---|
| Natura: | Preprint |
| Pubblicazione: |
2024
|
| Soggetti: | |
| Accesso online: | https://arxiv.org/abs/2406.08445 |
| Tags: |
Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
|
| _version_ | 1866916283870609408 |
|---|---|
| author | Yin, Chun Chi, Tai-Shih Tsao, Yu Wang, Hsin-Min |
| author_facet | Yin, Chun Chi, Tai-Shih Tsao, Yu Wang, Hsin-Min |
| contents | Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks. However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated. In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity. Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models. In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance. Furthermore, when WavLM is replaced by other SFMs, SVSNet+ still outperforms the baseline models and exhibits strong generalization ability. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2406_08445 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models Yin, Chun Chi, Tai-Shih Tsao, Yu Wang, Hsin-Min Audio and Speech Processing Machine Learning Sound Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks. However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated. In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity. Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models. In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance. Furthermore, when WavLM is replaced by other SFMs, SVSNet+ still outperforms the baseline models and exhibits strong generalization ability. |
| title | SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models |
| topic | Audio and Speech Processing Machine Learning Sound |
| url | https://arxiv.org/abs/2406.08445 |