Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Syu-Siang, Chen, Jia-Yang, Bai, Bo-Ren, Fang, Shih-Hau, Tsao, Yu
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Signal Processing
Online Access:	https://arxiv.org/abs/2407.01939
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911963394605056
author	Wang, Syu-Siang Chen, Jia-Yang Bai, Bo-Ren Fang, Shih-Hau Tsao, Yu
author_facet	Wang, Syu-Siang Chen, Jia-Yang Bai, Bo-Ren Fang, Shih-Hau Tsao, Yu
contents	The utilization of face masks is an essential healthcare measure, particularly during times of pandemics, yet it can present challenges in communication in our daily lives. To address this problem, we propose a novel approach known as the human-in-the-loop StarGAN (HL-StarGAN) face-masked speech enhancement method. HL-StarGAN comprises discriminator, classifier, metric assessment predictor, and generator that leverages an attention mechanism. The metric assessment predictor, referred to as MaskQSS, incorporates human participants in its development and serves as a "human-in-the-loop" module during the learning process of HL-StarGAN. The overall HL-StarGAN model was trained using an unsupervised learning strategy that simultaneously focuses on the reconstruction of the original clean speech and the optimization of human perception. To implement HL-StarGAN, we curated a face-masked speech database named "FMVD," which comprises recordings from 34 speakers in three distinct face-masked scenarios and a clean condition. We conducted subjective and objective tests on the proposed HL-StarGAN using this database. The outcomes of the test results are as follows: (1) MaskQSS successfully predicted the quality scores of face mask voices, outperforming several existing speech assessment methods. (2) The integration of the MaskQSS predictor enhanced the ability of HL-StarGAN to transform face mask voices into high-quality speech; this enhancement is evident in both objective and subjective tests, outperforming conventional StarGAN and CycleGAN-based systems.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_01939
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Unsupervised Face-Masked Speech Enhancement Using Generative Adversarial Networks With Human-in-the-Loop Assessment Metrics Wang, Syu-Siang Chen, Jia-Yang Bai, Bo-Ren Fang, Shih-Hau Tsao, Yu Audio and Speech Processing Signal Processing The utilization of face masks is an essential healthcare measure, particularly during times of pandemics, yet it can present challenges in communication in our daily lives. To address this problem, we propose a novel approach known as the human-in-the-loop StarGAN (HL-StarGAN) face-masked speech enhancement method. HL-StarGAN comprises discriminator, classifier, metric assessment predictor, and generator that leverages an attention mechanism. The metric assessment predictor, referred to as MaskQSS, incorporates human participants in its development and serves as a "human-in-the-loop" module during the learning process of HL-StarGAN. The overall HL-StarGAN model was trained using an unsupervised learning strategy that simultaneously focuses on the reconstruction of the original clean speech and the optimization of human perception. To implement HL-StarGAN, we curated a face-masked speech database named "FMVD," which comprises recordings from 34 speakers in three distinct face-masked scenarios and a clean condition. We conducted subjective and objective tests on the proposed HL-StarGAN using this database. The outcomes of the test results are as follows: (1) MaskQSS successfully predicted the quality scores of face mask voices, outperforming several existing speech assessment methods. (2) The integration of the MaskQSS predictor enhanced the ability of HL-StarGAN to transform face mask voices into high-quality speech; this enhancement is evident in both objective and subjective tests, outperforming conventional StarGAN and CycleGAN-based systems.
title	Unsupervised Face-Masked Speech Enhancement Using Generative Adversarial Networks With Human-in-the-Loop Assessment Metrics
topic	Audio and Speech Processing Signal Processing
url	https://arxiv.org/abs/2407.01939

Similar Items