Enregistré dans:
Détails bibliographiques
Auteurs principaux: Richter, Julius, Masuyama, Yoshiki, Boeddeker, Christoph, Edo, Takahiro, Wichern, Gordon, Roux, Jonathan Le
Format: Preprint
Publié: 2026
Sujets:
Accès en ligne:https://arxiv.org/abs/2605.06189
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866911657290104832
author Richter, Julius
Masuyama, Yoshiki
Boeddeker, Christoph
Edo, Takahiro
Wichern, Gordon
Roux, Jonathan Le
author_facet Richter, Julius
Masuyama, Yoshiki
Boeddeker, Christoph
Edo, Takahiro
Wichern, Gordon
Roux, Jonathan Le
contents We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to bridge predictive and generative modeling. Specifically, we decompose the interpolation dynamics into a task-specific drift and a stochastic denoising component, allowing a predictive estimate to be integrated directly into the generative sampling process. This results in a mathematically grounded framework for combining strong pretrained predictors with the expressive power of generative models. To this end, we train a score model using only clean speech, yielding a degradation-agnostic prior that can be reused across tasks. During inference, the predictor provides a deterministic drift that steers the sampling process toward a task-consistent estimate, while the score model preserves perceptual naturalness. Unlike prior hybrid approaches, which typically rely on architecture-specific conditioning and are tied to particular predictors or degradation settings, SIPS provides a unified framework that generalizes across predictors and additive degradation tasks. We demonstrate its effectiveness for both speech enhancement and speech separation using recent predictors such as SEMamba and FlexIO. The proposed method consistently improves perceptual quality, achieving gains up +1.0 NISQA for speech separation.
format Preprint
id arxiv_https___arxiv_org_abs_2605_06189
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Predictive-Generative Drift Decomposition for Speech Enhancement and Separation
Richter, Julius
Masuyama, Yoshiki
Boeddeker, Christoph
Edo, Takahiro
Wichern, Gordon
Roux, Jonathan Le
Audio and Speech Processing
Machine Learning
We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to bridge predictive and generative modeling. Specifically, we decompose the interpolation dynamics into a task-specific drift and a stochastic denoising component, allowing a predictive estimate to be integrated directly into the generative sampling process. This results in a mathematically grounded framework for combining strong pretrained predictors with the expressive power of generative models. To this end, we train a score model using only clean speech, yielding a degradation-agnostic prior that can be reused across tasks. During inference, the predictor provides a deterministic drift that steers the sampling process toward a task-consistent estimate, while the score model preserves perceptual naturalness. Unlike prior hybrid approaches, which typically rely on architecture-specific conditioning and are tied to particular predictors or degradation settings, SIPS provides a unified framework that generalizes across predictors and additive degradation tasks. We demonstrate its effectiveness for both speech enhancement and speech separation using recent predictors such as SEMamba and FlexIO. The proposed method consistently improves perceptual quality, achieving gains up +1.0 NISQA for speech separation.
title Predictive-Generative Drift Decomposition for Speech Enhancement and Separation
topic Audio and Speech Processing
Machine Learning
url https://arxiv.org/abs/2605.06189