Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Rajagopalan, Rajalaxmi, Giri, Ritwik, Tang, Zhiqiang, Han, Kyu
Format:	Preprint
Published:	2026
Subjects:	Sound Machine Learning
Online Access:	https://arxiv.org/abs/2602.02413
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912868890312704
author	Rajagopalan, Rajalaxmi Giri, Ritwik Tang, Zhiqiang Han, Kyu
author_facet	Rajagopalan, Rajalaxmi Giri, Ritwik Tang, Zhiqiang Han, Kyu
contents	Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-related downstream applications are desired. In this work, we develop a masked autoencoder based universal speech enhancer that is agnostic to the type of distortion affecting speech, can handle multiple distortions simultaneously, and is trained in a self-supervised manner. An augmentation stack adds further distortions to the noisy input data. The masked autoencoder model learns to remove the added distortions along with reconstructing the masked regions of the spectrogram during pre-training. The pre-trained embeddings are then used by fine-tuning models trained on a small amount of paired data for specific downstream tasks. We evaluate the pre-trained features for denoising and dereverberation downstream tasks. We explore different augmentations (like single or multi-speaker) in the pre-training augmentation stack and the effect of different noisy input feature representations (like $log1p$ compression) on pre-trained embeddings and downstream fine-tuning enhancement performance. We show that the proposed method not only outperforms the baseline but also achieves state-of-the-art performance for both in-domain and out-of-domain evaluation datasets.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_02413
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Masked Autoencoders as Universal Speech Enhancer Rajagopalan, Rajalaxmi Giri, Ritwik Tang, Zhiqiang Han, Kyu Sound Machine Learning Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-related downstream applications are desired. In this work, we develop a masked autoencoder based universal speech enhancer that is agnostic to the type of distortion affecting speech, can handle multiple distortions simultaneously, and is trained in a self-supervised manner. An augmentation stack adds further distortions to the noisy input data. The masked autoencoder model learns to remove the added distortions along with reconstructing the masked regions of the spectrogram during pre-training. The pre-trained embeddings are then used by fine-tuning models trained on a small amount of paired data for specific downstream tasks. We evaluate the pre-trained features for denoising and dereverberation downstream tasks. We explore different augmentations (like single or multi-speaker) in the pre-training augmentation stack and the effect of different noisy input feature representations (like $log1p$ compression) on pre-trained embeddings and downstream fine-tuning enhancement performance. We show that the proposed method not only outperforms the baseline but also achieves state-of-the-art performance for both in-domain and out-of-domain evaluation datasets.
title	Masked Autoencoders as Universal Speech Enhancer
topic	Sound Machine Learning
url	https://arxiv.org/abs/2602.02413

Similar Items