Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Fu, Szu-Wei, Chao, Rong, Yang, Xuesong, Huang, Sung-Feng, Zezario, Ryandhimas E., Nasretdinov, Rauf, Jukić, Ante, Tsao, Yu, Wang, Yu-Chiang Frank
Format:	Preprint
Published:	2026
Subjects:	Sound
Online Access:	https://arxiv.org/abs/2603.02641
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914523014758400
author	Fu, Szu-Wei Chao, Rong Yang, Xuesong Huang, Sung-Feng Zezario, Ryandhimas E. Nasretdinov, Rauf Jukić, Ante Tsao, Yu Wang, Yu-Chiang Frank
author_facet	Fu, Szu-Wei Chao, Rong Yang, Xuesong Huang, Sung-Feng Zezario, Ryandhimas E. Nasretdinov, Rauf Jukić, Ante Tsao, Yu Wang, Yu-Chiang Frank
contents	Universal Speech Enhancement (USE) aims to restore speech quality under diverse degradation conditions while preserving signal fidelity. Despite recent progress, key challenges in training target selection, the distortion--perception tradeoff, and data curation remain unresolved. In this work, we systematically address these three overlooked problems. First, we revisit the conventional practice of using early-reflected speech as the dereverberation target and show that it can degrade perceptual quality and downstream ASR performance. We instead demonstrate that time-shifted anechoic clean speech provides a superior learning target. Second, guided by the distortion--perception tradeoff theory, we propose a simple two-stage framework that achieves minimal distortion under a given level of perceptual quality. Third, we analyze the trade-off between training data scale and quality for USE, revealing that training on large uncurated corpora imposes a performance ceiling, as models struggle to remove subtle artifacts. Our method achieves state-of-the-art performance on the URGENT 2025 non-blind test set and exhibits strong language-agnostic generalization, making it effective for improving TTS training data. Model weights are available for download at: https://huggingface.co/nvidia/RE-USE.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_02641
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement Fu, Szu-Wei Chao, Rong Yang, Xuesong Huang, Sung-Feng Zezario, Ryandhimas E. Nasretdinov, Rauf Jukić, Ante Tsao, Yu Wang, Yu-Chiang Frank Sound Universal Speech Enhancement (USE) aims to restore speech quality under diverse degradation conditions while preserving signal fidelity. Despite recent progress, key challenges in training target selection, the distortion--perception tradeoff, and data curation remain unresolved. In this work, we systematically address these three overlooked problems. First, we revisit the conventional practice of using early-reflected speech as the dereverberation target and show that it can degrade perceptual quality and downstream ASR performance. We instead demonstrate that time-shifted anechoic clean speech provides a superior learning target. Second, guided by the distortion--perception tradeoff theory, we propose a simple two-stage framework that achieves minimal distortion under a given level of perceptual quality. Third, we analyze the trade-off between training data scale and quality for USE, revealing that training on large uncurated corpora imposes a performance ceiling, as models struggle to remove subtle artifacts. Our method achieves state-of-the-art performance on the URGENT 2025 non-blind test set and exhibits strong language-agnostic generalization, making it effective for improving TTS training data. Model weights are available for download at: https://huggingface.co/nvidia/RE-USE.
title	Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement
topic	Sound
url	https://arxiv.org/abs/2603.02641

Similar Items