Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ren, Zhao, Scheck, Kevin, Hou, Qinhan, van Gogh, Stefano, Wand, Michael, Schultz, Tanja
Format:	Preprint
Published:	2024
Subjects:	Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2405.08021
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913348934696960
author	Ren, Zhao Scheck, Kevin Hou, Qinhan van Gogh, Stefano Wand, Michael Schultz, Tanja
author_facet	Ren, Zhao Scheck, Kevin Hou, Qinhan van Gogh, Stefano Wand, Michael Schultz, Tanja
contents	Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech signals. Due to an inadequate amount of available data and noisy signals, the synthesised speech often exhibits a low level of naturalness. In this work, we propose Diff-ETS, an ETS model which uses a score-based diffusion probabilistic model to enhance the naturalness of synthesised speech. The diffusion model is applied to improve the quality of the acoustic features predicted by an EMG encoder. In our experiments, we evaluated fine-tuning the diffusion model on predictions of a pre-trained EMG encoder, and training both models in an end-to-end fashion. We compared Diff-ETS with a baseline ETS model without diffusion using objective metrics and a listening test. The results indicated the proposed Diff-ETS significantly improved speech naturalness over the baseline.
format	Preprint
id	arxiv_https___arxiv_org_abs_2405_08021
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Diff-ETS: Learning a Diffusion Probabilistic Model for Electromyography-to-Speech Conversion Ren, Zhao Scheck, Kevin Hou, Qinhan van Gogh, Stefano Wand, Michael Schultz, Tanja Sound Audio and Speech Processing Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech signals. Due to an inadequate amount of available data and noisy signals, the synthesised speech often exhibits a low level of naturalness. In this work, we propose Diff-ETS, an ETS model which uses a score-based diffusion probabilistic model to enhance the naturalness of synthesised speech. The diffusion model is applied to improve the quality of the acoustic features predicted by an EMG encoder. In our experiments, we evaluated fine-tuning the diffusion model on predictions of a pre-trained EMG encoder, and training both models in an end-to-end fashion. We compared Diff-ETS with a baseline ETS model without diffusion using objective metrics and a listening test. The results indicated the proposed Diff-ETS significantly improved speech naturalness over the baseline.
title	Diff-ETS: Learning a Diffusion Probabilistic Model for Electromyography-to-Speech Conversion
topic	Sound Audio and Speech Processing
url	https://arxiv.org/abs/2405.08021

Similar Items