Saved in:
Bibliographic Details
Main Authors: Li, Yue, Hindriks, Koen V., Kunneman, Florian A.
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.06274
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910597555159040
author Li, Yue
Hindriks, Koen V.
Kunneman, Florian A.
author_facet Li, Yue
Hindriks, Koen V.
Kunneman, Florian A.
contents Spectral subtraction, widely used for its simplicity, has been employed to address the Robot Ego Speech Filtering (RESF) problem for detecting speech contents of human interruption from robot's single-channel microphone recordings when it is speaking. However, this approach suffers from oversubtraction in the fundamental frequency range (FFR), leading to degraded speech content recognition. To address this, we propose a Two-Mask Conformer-based Metric Generative Adversarial Network (CMGAN) to enhance the detected speech and improve recognition results. Our model compensates for oversubtracted FFR values with high-frequency information and long-term features and then de-noises the new spectrogram. In addition, we introduce an incremental processing method that allows semi-real-time audio processing with streaming input on a network trained on long fixed-length input. Evaluations of two datasets, including one with unseen noise, demonstrate significant improvements in recognition accuracy and the effectiveness of the proposed two-mask approach and incremental processing, enhancing the robustness of the proposed RESF pipeline in real-world HRI scenarios.
format Preprint
id arxiv_https___arxiv_org_abs_2409_06274
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Spectral oversubtraction? An approach for speech enhancement after robot ego speech filtering in semi-real-time
Li, Yue
Hindriks, Koen V.
Kunneman, Florian A.
Robotics
Sound
Audio and Speech Processing
68T50
Spectral subtraction, widely used for its simplicity, has been employed to address the Robot Ego Speech Filtering (RESF) problem for detecting speech contents of human interruption from robot's single-channel microphone recordings when it is speaking. However, this approach suffers from oversubtraction in the fundamental frequency range (FFR), leading to degraded speech content recognition. To address this, we propose a Two-Mask Conformer-based Metric Generative Adversarial Network (CMGAN) to enhance the detected speech and improve recognition results. Our model compensates for oversubtracted FFR values with high-frequency information and long-term features and then de-noises the new spectrogram. In addition, we introduce an incremental processing method that allows semi-real-time audio processing with streaming input on a network trained on long fixed-length input. Evaluations of two datasets, including one with unseen noise, demonstrate significant improvements in recognition accuracy and the effectiveness of the proposed two-mask approach and incremental processing, enhancing the robustness of the proposed RESF pipeline in real-world HRI scenarios.
title Spectral oversubtraction? An approach for speech enhancement after robot ego speech filtering in semi-real-time
topic Robotics
Sound
Audio and Speech Processing
68T50
url https://arxiv.org/abs/2409.06274