Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Richter, Julius, Masuyama, Yoshiki, Boeddeker, Christoph, Edo, Takahiro, Wichern, Gordon, Roux, Jonathan Le
Format:	Preprint
Publié:	2026
Sujets:	Audio and Speech Processing Machine Learning
Accès en ligne:	https://arxiv.org/abs/2605.06189
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866911657290104832
author	Richter, Julius Masuyama, Yoshiki Boeddeker, Christoph Edo, Takahiro Wichern, Gordon Roux, Jonathan Le
author_facet	Richter, Julius Masuyama, Yoshiki Boeddeker, Christoph Edo, Takahiro Wichern, Gordon Roux, Jonathan Le
contents	We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to bridge predictive and generative modeling. Specifically, we decompose the interpolation dynamics into a task-specific drift and a stochastic denoising component, allowing a predictive estimate to be integrated directly into the generative sampling process. This results in a mathematically grounded framework for combining strong pretrained predictors with the expressive power of generative models. To this end, we train a score model using only clean speech, yielding a degradation-agnostic prior that can be reused across tasks. During inference, the predictor provides a deterministic drift that steers the sampling process toward a task-consistent estimate, while the score model preserves perceptual naturalness. Unlike prior hybrid approaches, which typically rely on architecture-specific conditioning and are tied to particular predictors or degradation settings, SIPS provides a unified framework that generalizes across predictors and additive degradation tasks. We demonstrate its effectiveness for both speech enhancement and speech separation using recent predictors such as SEMamba and FlexIO. The proposed method consistently improves perceptual quality, achieving gains up +1.0 NISQA for speech separation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_06189
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Predictive-Generative Drift Decomposition for Speech Enhancement and Separation Richter, Julius Masuyama, Yoshiki Boeddeker, Christoph Edo, Takahiro Wichern, Gordon Roux, Jonathan Le Audio and Speech Processing Machine Learning We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to bridge predictive and generative modeling. Specifically, we decompose the interpolation dynamics into a task-specific drift and a stochastic denoising component, allowing a predictive estimate to be integrated directly into the generative sampling process. This results in a mathematically grounded framework for combining strong pretrained predictors with the expressive power of generative models. To this end, we train a score model using only clean speech, yielding a degradation-agnostic prior that can be reused across tasks. During inference, the predictor provides a deterministic drift that steers the sampling process toward a task-consistent estimate, while the score model preserves perceptual naturalness. Unlike prior hybrid approaches, which typically rely on architecture-specific conditioning and are tied to particular predictors or degradation settings, SIPS provides a unified framework that generalizes across predictors and additive degradation tasks. We demonstrate its effectiveness for both speech enhancement and speech separation using recent predictors such as SEMamba and FlexIO. The proposed method consistently improves perceptual quality, achieving gains up +1.0 NISQA for speech separation.
title	Predictive-Generative Drift Decomposition for Speech Enhancement and Separation
topic	Audio and Speech Processing Machine Learning
url	https://arxiv.org/abs/2605.06189

Documents similaires