Saved in:
Bibliographic Details
Main Authors: Wang, Wei, Zhang, Wangyou, Li, Chenda, Wang, Jiahe, Cornell, Samuele, Sach, Marvin, Saijo, Kohei, Fu, Yihui, Ni, Zhaoheng, Han, Bing, Gong, Xun, Bi, Mengxiao, Fingscheidt, Tim, Watanabe, Shinji, Qian, Yanmin
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.18438
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917223699841024
author Wang, Wei
Zhang, Wangyou
Li, Chenda
Wang, Jiahe
Cornell, Samuele
Sach, Marvin
Saijo, Kohei
Fu, Yihui
Ni, Zhaoheng
Han, Bing
Gong, Xun
Bi, Mengxiao
Fingscheidt, Tim
Watanabe, Shinji
Qian, Yanmin
author_facet Wang, Wei
Zhang, Wangyou
Li, Chenda
Wang, Jiahe
Cornell, Samuele
Sach, Marvin
Saijo, Kohei
Fu, Yihui
Ni, Zhaoheng
Han, Bing
Gong, Xun
Bi, Mengxiao
Fingscheidt, Tim
Watanabe, Shinji
Qian, Yanmin
contents Automatic speech quality assessment has become increasingly important as modern speech generation systems continue to advance, while human listening tests remain costly, time-consuming, and difficult to scale. Most existing learning-based assessment models rely primarily on scarce human-annotated mean opinion score (MOS) data, which limits robustness and generalization, especially when training across heterogeneous datasets. In this work, we propose UrgentMOS, a unified speech quality assessment framework that jointly learns from diverse objective and perceptual quality metrics, while explicitly tolerating the absence of arbitrary subsets of metrics during training. By leveraging complementary quality facets under heterogeneous supervision, UrgentMOS enables effective utilization of partially annotated data and improves robustness when trained on large-scale, multi-source datasets. Beyond absolute score prediction, UrgentMOS explicitly models pairwise quality preferences by directly predicting comparative MOS (CMOS), making it well suited for preference-based evaluation scenarios commonly adopted in system benchmarking. Extensive experiments across a wide range of speech quality datasets, including simulated distortions, speech enhancement, and speech synthesis, demonstrate that UrgentMOS consistently achieves state-of-the-art performance in both absolute and comparative evaluation settings.
format Preprint
id arxiv_https___arxiv_org_abs_2601_18438
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle UrgentMOS: Unified Multi-Metric and Preference Learning for Robust Speech Quality Assessment
Wang, Wei
Zhang, Wangyou
Li, Chenda
Wang, Jiahe
Cornell, Samuele
Sach, Marvin
Saijo, Kohei
Fu, Yihui
Ni, Zhaoheng
Han, Bing
Gong, Xun
Bi, Mengxiao
Fingscheidt, Tim
Watanabe, Shinji
Qian, Yanmin
Sound
Automatic speech quality assessment has become increasingly important as modern speech generation systems continue to advance, while human listening tests remain costly, time-consuming, and difficult to scale. Most existing learning-based assessment models rely primarily on scarce human-annotated mean opinion score (MOS) data, which limits robustness and generalization, especially when training across heterogeneous datasets. In this work, we propose UrgentMOS, a unified speech quality assessment framework that jointly learns from diverse objective and perceptual quality metrics, while explicitly tolerating the absence of arbitrary subsets of metrics during training. By leveraging complementary quality facets under heterogeneous supervision, UrgentMOS enables effective utilization of partially annotated data and improves robustness when trained on large-scale, multi-source datasets. Beyond absolute score prediction, UrgentMOS explicitly models pairwise quality preferences by directly predicting comparative MOS (CMOS), making it well suited for preference-based evaluation scenarios commonly adopted in system benchmarking. Extensive experiments across a wide range of speech quality datasets, including simulated distortions, speech enhancement, and speech synthesis, demonstrate that UrgentMOS consistently achieves state-of-the-art performance in both absolute and comparative evaluation settings.
title UrgentMOS: Unified Multi-Metric and Preference Learning for Robust Speech Quality Assessment
topic Sound
url https://arxiv.org/abs/2601.18438