Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Wei, Zhang, Wangyou, Li, Chenda, Wang, Jiahe, Cornell, Samuele, Sach, Marvin, Saijo, Kohei, Fu, Yihui, Ni, Zhaoheng, Han, Bing, Gong, Xun, Bi, Mengxiao, Fingscheidt, Tim, Watanabe, Shinji, Qian, Yanmin
Format:	Preprint
Published:	2026
Subjects:	Sound
Online Access:	https://arxiv.org/abs/2601.18438
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917223699841024
author	Wang, Wei Zhang, Wangyou Li, Chenda Wang, Jiahe Cornell, Samuele Sach, Marvin Saijo, Kohei Fu, Yihui Ni, Zhaoheng Han, Bing Gong, Xun Bi, Mengxiao Fingscheidt, Tim Watanabe, Shinji Qian, Yanmin
author_facet	Wang, Wei Zhang, Wangyou Li, Chenda Wang, Jiahe Cornell, Samuele Sach, Marvin Saijo, Kohei Fu, Yihui Ni, Zhaoheng Han, Bing Gong, Xun Bi, Mengxiao Fingscheidt, Tim Watanabe, Shinji Qian, Yanmin
contents	Automatic speech quality assessment has become increasingly important as modern speech generation systems continue to advance, while human listening tests remain costly, time-consuming, and difficult to scale. Most existing learning-based assessment models rely primarily on scarce human-annotated mean opinion score (MOS) data, which limits robustness and generalization, especially when training across heterogeneous datasets. In this work, we propose UrgentMOS, a unified speech quality assessment framework that jointly learns from diverse objective and perceptual quality metrics, while explicitly tolerating the absence of arbitrary subsets of metrics during training. By leveraging complementary quality facets under heterogeneous supervision, UrgentMOS enables effective utilization of partially annotated data and improves robustness when trained on large-scale, multi-source datasets. Beyond absolute score prediction, UrgentMOS explicitly models pairwise quality preferences by directly predicting comparative MOS (CMOS), making it well suited for preference-based evaluation scenarios commonly adopted in system benchmarking. Extensive experiments across a wide range of speech quality datasets, including simulated distortions, speech enhancement, and speech synthesis, demonstrate that UrgentMOS consistently achieves state-of-the-art performance in both absolute and comparative evaluation settings.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_18438
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	UrgentMOS: Unified Multi-Metric and Preference Learning for Robust Speech Quality Assessment Wang, Wei Zhang, Wangyou Li, Chenda Wang, Jiahe Cornell, Samuele Sach, Marvin Saijo, Kohei Fu, Yihui Ni, Zhaoheng Han, Bing Gong, Xun Bi, Mengxiao Fingscheidt, Tim Watanabe, Shinji Qian, Yanmin Sound Automatic speech quality assessment has become increasingly important as modern speech generation systems continue to advance, while human listening tests remain costly, time-consuming, and difficult to scale. Most existing learning-based assessment models rely primarily on scarce human-annotated mean opinion score (MOS) data, which limits robustness and generalization, especially when training across heterogeneous datasets. In this work, we propose UrgentMOS, a unified speech quality assessment framework that jointly learns from diverse objective and perceptual quality metrics, while explicitly tolerating the absence of arbitrary subsets of metrics during training. By leveraging complementary quality facets under heterogeneous supervision, UrgentMOS enables effective utilization of partially annotated data and improves robustness when trained on large-scale, multi-source datasets. Beyond absolute score prediction, UrgentMOS explicitly models pairwise quality preferences by directly predicting comparative MOS (CMOS), making it well suited for preference-based evaluation scenarios commonly adopted in system benchmarking. Extensive experiments across a wide range of speech quality datasets, including simulated distortions, speech enhancement, and speech synthesis, demonstrate that UrgentMOS consistently achieves state-of-the-art performance in both absolute and comparative evaluation settings.
title	UrgentMOS: Unified Multi-Metric and Preference Learning for Robust Speech Quality Assessment
topic	Sound
url	https://arxiv.org/abs/2601.18438

Similar Items