Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shi, Yu-Fei, Ai, Yang, Lu, Ye-Xin, Du, Hui-Peng, Ling, Zhen-Hua
Format:	Preprint
Published:	2024
Subjects:	Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2411.11232
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912122944880640
author	Shi, Yu-Fei Ai, Yang Lu, Ye-Xin Du, Hui-Peng Ling, Zhen-Hua
author_facet	Shi, Yu-Fei Ai, Yang Lu, Ye-Xin Du, Hui-Peng Ling, Zhen-Hua
contents	Assessing the naturalness of speech using mean opinion score (MOS) prediction models has positive implications for the automatic evaluation of speech synthesis systems. Early MOS prediction models took the raw waveform or amplitude spectrum of speech as input, whereas more advanced methods employed self-supervised-learning (SSL) based models to extract semantic representations from speech for MOS prediction. These methods utilized limited aspects of speech information for MOS prediction, resulting in restricted prediction accuracy. Therefore, in this paper, we propose SAMOS, a MOS prediction model that leverages both Semantic and Acoustic information of speech to be assessed. Specifically, the proposed SAMOS leverages a pretrained wav2vec2 to extract semantic representations and uses the feature extractor of a pretrained BiVocoder to extract acoustic features. These two types of features are then fed into the prediction network, which includes multi-task heads and an aggregation layer, to obtain the final MOS score. Experimental results demonstrate that the proposed SAMOS outperforms current state-of-the-art MOS prediction models on the BVCC dataset and performs comparable performance on the BC2019 dataset, according to the results of system-level evaluation metrics.
format	Preprint
id	arxiv_https___arxiv_org_abs_2411_11232
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations and Acoustic Features Shi, Yu-Fei Ai, Yang Lu, Ye-Xin Du, Hui-Peng Ling, Zhen-Hua Sound Audio and Speech Processing Assessing the naturalness of speech using mean opinion score (MOS) prediction models has positive implications for the automatic evaluation of speech synthesis systems. Early MOS prediction models took the raw waveform or amplitude spectrum of speech as input, whereas more advanced methods employed self-supervised-learning (SSL) based models to extract semantic representations from speech for MOS prediction. These methods utilized limited aspects of speech information for MOS prediction, resulting in restricted prediction accuracy. Therefore, in this paper, we propose SAMOS, a MOS prediction model that leverages both Semantic and Acoustic information of speech to be assessed. Specifically, the proposed SAMOS leverages a pretrained wav2vec2 to extract semantic representations and uses the feature extractor of a pretrained BiVocoder to extract acoustic features. These two types of features are then fed into the prediction network, which includes multi-task heads and an aggregation layer, to obtain the final MOS score. Experimental results demonstrate that the proposed SAMOS outperforms current state-of-the-art MOS prediction models on the BVCC dataset and performs comparable performance on the BC2019 dataset, according to the results of system-level evaluation metrics.
title	SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations and Acoustic Features
topic	Sound Audio and Speech Processing
url	https://arxiv.org/abs/2411.11232

Similar Items