Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Maciejewski, Matthew, Cornell, Samuele
Format:	Preprint
Published:	2026
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2604.08415
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908949654011904
author	Maciejewski, Matthew Cornell, Samuele
author_facet	Maciejewski, Matthew Cornell, Samuele
contents	Noisy speech separation systems are typically trained on fully-synthetic mixtures, limiting generalization to real-world scenarios. Though training on mixtures of in-domain (thus often noisy) speech is possible, we show that this leads to undesirable optima where mixture noise is retained in the estimates, due to the inseparability of the background noises and the loss function's symmetry. To address this, we propose ring mixing, a batch strategy of using each source in two mixtures, alongside a new Signal-to-Consistency-Error Ratio (SCER) auxiliary loss penalizing inconsistent estimates of the same source from different mixtures, breaking symmetry and incentivizing denoising. On a WHAM!-based benchmark, our method can reduce residual noise by upwards of half, effectively learning to denoise from only noisy recordings. This opens the door to training more generalizable systems using in-the-wild data, which we demonstrate via systems trained using naturally-noisy speech from VoxCeleb.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_08415
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Ring Mixing with Auxiliary Signal-to-Consistency-Error Ratio Loss for Unsupervised Denoising in Speech Separation Maciejewski, Matthew Cornell, Samuele Audio and Speech Processing Noisy speech separation systems are typically trained on fully-synthetic mixtures, limiting generalization to real-world scenarios. Though training on mixtures of in-domain (thus often noisy) speech is possible, we show that this leads to undesirable optima where mixture noise is retained in the estimates, due to the inseparability of the background noises and the loss function's symmetry. To address this, we propose ring mixing, a batch strategy of using each source in two mixtures, alongside a new Signal-to-Consistency-Error Ratio (SCER) auxiliary loss penalizing inconsistent estimates of the same source from different mixtures, breaking symmetry and incentivizing denoising. On a WHAM!-based benchmark, our method can reduce residual noise by upwards of half, effectively learning to denoise from only noisy recordings. This opens the door to training more generalizable systems using in-the-wild data, which we demonstrate via systems trained using naturally-noisy speech from VoxCeleb.
title	Ring Mixing with Auxiliary Signal-to-Consistency-Error Ratio Loss for Unsupervised Denoising in Speech Separation
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2604.08415

Similar Items