Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Moon, Junwon, Choi, Hyunjin, Park, Hansol, Kim, Heeseung, Shim, Kyuhong
Formato:	Preprint
Publicado:	2026
Materias:	Sound Artificial Intelligence
Acceso en línea:	https://arxiv.org/abs/2603.12837
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866914390856433664
author	Moon, Junwon Choi, Hyunjin Park, Hansol Kim, Heeseung Shim, Kyuhong
author_facet	Moon, Junwon Choi, Hyunjin Park, Hansol Kim, Heeseung Shim, Kyuhong
contents	Target speaker extraction (TSE) extracts the target speaker's voice from overlapping speech mixtures given a reference utterance. Existing approaches typically fall into two categories: discriminative and generative. Discriminative methods apply time-frequency masking for fast inference but often over-suppress the target signal, while generative methods synthesize high-quality speech at the cost of numerous iterative steps. We propose Mask2Flow-TSE, a two-stage framework combining the strengths of both paradigms. The first stage applies discriminative masking for coarse separation, and the second stage employs flow matching to refine the output toward target speech. Unlike generative approaches that synthesize speech from Gaussian noise, our method starts from the masked spectrogram, enabling high-quality reconstruction in a single inference step. Experiments show that Mask2Flow-TSE achieves comparable performance to existing generative TSE methods with approximately 85M parameters.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_12837
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Mask2Flow-TSE: Two-Stage Target Speaker Extraction with Masking and Flow Matching Moon, Junwon Choi, Hyunjin Park, Hansol Kim, Heeseung Shim, Kyuhong Sound Artificial Intelligence Target speaker extraction (TSE) extracts the target speaker's voice from overlapping speech mixtures given a reference utterance. Existing approaches typically fall into two categories: discriminative and generative. Discriminative methods apply time-frequency masking for fast inference but often over-suppress the target signal, while generative methods synthesize high-quality speech at the cost of numerous iterative steps. We propose Mask2Flow-TSE, a two-stage framework combining the strengths of both paradigms. The first stage applies discriminative masking for coarse separation, and the second stage employs flow matching to refine the output toward target speech. Unlike generative approaches that synthesize speech from Gaussian noise, our method starts from the masked spectrogram, enabling high-quality reconstruction in a single inference step. Experiments show that Mask2Flow-TSE achieves comparable performance to existing generative TSE methods with approximately 85M parameters.
title	Mask2Flow-TSE: Two-Stage Target Speaker Extraction with Masking and Flow Matching
topic	Sound Artificial Intelligence
url	https://arxiv.org/abs/2603.12837

Ejemplares similares