Saved in:
Bibliographic Details
Main Authors: Lei, Tong, Hu, Qinwen, Lin, Ziyao, Li, Andong, Chen, Rilin, Yu, Meng, Yu, Dong, Lu, Jing
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.12936
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909579798904832
author Lei, Tong
Hu, Qinwen
Lin, Ziyao
Li, Andong
Chen, Rilin
Yu, Meng
Yu, Dong
Lu, Jing
author_facet Lei, Tong
Hu, Qinwen
Lin, Ziyao
Li, Andong
Chen, Rilin
Yu, Meng
Yu, Dong
Lu, Jing
contents The prevailing method for neural speech enhancement predominantly utilizes fully-supervised deep learning with simulated pairs of far-field noisy-reverberant speech and clean speech. Nonetheless, these models frequently demonstrate restricted generalizability to mixtures recorded in real-world conditions. To address this issue, this study investigates training enhancement models directly on real mixtures. Specifically, we revisit the single-channel far-field to near-field speech enhancement (FNSE) task, focusing on real-world data characterized by low signal-to-noise ratio (SNR), high reverberation, and mid-to-high frequency attenuation. We propose FNSE-SBGAN, a framework that integrates a Schrodinger Bridge (SB)-based diffusion model with generative adversarial networks (GANs). Our approach achieves state-of-the-art performance across various metrics and subjective evaluations, significantly reducing the character error rate (CER) by up to 14.58% compared to far-field signals. Experimental results demonstrate that FNSE-SBGAN preserves superior subjective quality and establishes a new benchmark for real-world far-field speech enhancement. Additionally, we introduce an evaluation framework leveraging matrix rank analysis in the time-frequency domain, providing systematic insights into model performance and revealing the strengths and weaknesses of different generative methods.
format Preprint
id arxiv_https___arxiv_org_abs_2503_12936
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle FNSE-SBGAN: Far-field Speech Enhancement with Schrodinger Bridge and Generative Adversarial Networks
Lei, Tong
Hu, Qinwen
Lin, Ziyao
Li, Andong
Chen, Rilin
Yu, Meng
Yu, Dong
Lu, Jing
Audio and Speech Processing
The prevailing method for neural speech enhancement predominantly utilizes fully-supervised deep learning with simulated pairs of far-field noisy-reverberant speech and clean speech. Nonetheless, these models frequently demonstrate restricted generalizability to mixtures recorded in real-world conditions. To address this issue, this study investigates training enhancement models directly on real mixtures. Specifically, we revisit the single-channel far-field to near-field speech enhancement (FNSE) task, focusing on real-world data characterized by low signal-to-noise ratio (SNR), high reverberation, and mid-to-high frequency attenuation. We propose FNSE-SBGAN, a framework that integrates a Schrodinger Bridge (SB)-based diffusion model with generative adversarial networks (GANs). Our approach achieves state-of-the-art performance across various metrics and subjective evaluations, significantly reducing the character error rate (CER) by up to 14.58% compared to far-field signals. Experimental results demonstrate that FNSE-SBGAN preserves superior subjective quality and establishes a new benchmark for real-world far-field speech enhancement. Additionally, we introduce an evaluation framework leveraging matrix rank analysis in the time-frequency domain, providing systematic insights into model performance and revealing the strengths and weaknesses of different generative methods.
title FNSE-SBGAN: Far-field Speech Enhancement with Schrodinger Bridge and Generative Adversarial Networks
topic Audio and Speech Processing
url https://arxiv.org/abs/2503.12936