Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lei, Tong, Hu, Qinwen, Lin, Ziyao, Li, Andong, Chen, Rilin, Yu, Meng, Yu, Dong, Lu, Jing
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2503.12936
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909579798904832
author	Lei, Tong Hu, Qinwen Lin, Ziyao Li, Andong Chen, Rilin Yu, Meng Yu, Dong Lu, Jing
author_facet	Lei, Tong Hu, Qinwen Lin, Ziyao Li, Andong Chen, Rilin Yu, Meng Yu, Dong Lu, Jing
contents	The prevailing method for neural speech enhancement predominantly utilizes fully-supervised deep learning with simulated pairs of far-field noisy-reverberant speech and clean speech. Nonetheless, these models frequently demonstrate restricted generalizability to mixtures recorded in real-world conditions. To address this issue, this study investigates training enhancement models directly on real mixtures. Specifically, we revisit the single-channel far-field to near-field speech enhancement (FNSE) task, focusing on real-world data characterized by low signal-to-noise ratio (SNR), high reverberation, and mid-to-high frequency attenuation. We propose FNSE-SBGAN, a framework that integrates a Schrodinger Bridge (SB)-based diffusion model with generative adversarial networks (GANs). Our approach achieves state-of-the-art performance across various metrics and subjective evaluations, significantly reducing the character error rate (CER) by up to 14.58% compared to far-field signals. Experimental results demonstrate that FNSE-SBGAN preserves superior subjective quality and establishes a new benchmark for real-world far-field speech enhancement. Additionally, we introduce an evaluation framework leveraging matrix rank analysis in the time-frequency domain, providing systematic insights into model performance and revealing the strengths and weaknesses of different generative methods.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_12936
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	FNSE-SBGAN: Far-field Speech Enhancement with Schrodinger Bridge and Generative Adversarial Networks Lei, Tong Hu, Qinwen Lin, Ziyao Li, Andong Chen, Rilin Yu, Meng Yu, Dong Lu, Jing Audio and Speech Processing The prevailing method for neural speech enhancement predominantly utilizes fully-supervised deep learning with simulated pairs of far-field noisy-reverberant speech and clean speech. Nonetheless, these models frequently demonstrate restricted generalizability to mixtures recorded in real-world conditions. To address this issue, this study investigates training enhancement models directly on real mixtures. Specifically, we revisit the single-channel far-field to near-field speech enhancement (FNSE) task, focusing on real-world data characterized by low signal-to-noise ratio (SNR), high reverberation, and mid-to-high frequency attenuation. We propose FNSE-SBGAN, a framework that integrates a Schrodinger Bridge (SB)-based diffusion model with generative adversarial networks (GANs). Our approach achieves state-of-the-art performance across various metrics and subjective evaluations, significantly reducing the character error rate (CER) by up to 14.58% compared to far-field signals. Experimental results demonstrate that FNSE-SBGAN preserves superior subjective quality and establishes a new benchmark for real-world far-field speech enhancement. Additionally, we introduce an evaluation framework leveraging matrix rank analysis in the time-frequency domain, providing systematic insights into model performance and revealing the strengths and weaknesses of different generative methods.
title	FNSE-SBGAN: Far-field Speech Enhancement with Schrodinger Bridge and Generative Adversarial Networks
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2503.12936

Similar Items