Saved in:
| Main Authors: | , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.01321 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911900627894272 |
|---|---|
| author | Elyaderani, Mahsa Kadkhodaei Shirani, Shahram |
| author_facet | Elyaderani, Mahsa Kadkhodaei Shirani, Shahram |
| contents | Speech in-painting is the task of regenerating missing audio contents using reliable context information. Despite various recent studies in multi-modal perception of audio in-painting, there is still a need for an effective infusion of visual and auditory information in speech in-painting. In this paper, we introduce a novel sequence-to-sequence model that leverages the visual information to in-paint audio signals via an encoder-decoder architecture. The encoder plays the role of a lip-reader for facial recordings and the decoder takes both encoder outputs as well as the distorted audio spectrograms to restore the original speech. Our model outperforms an audio-only speech in-painting model and has comparable results with a recent multi-modal speech in-painter in terms of speech quality and intelligibility metrics for distortions of 300 ms to 1500 ms duration, which proves the effectiveness of the introduced multi-modality in speech in-painting. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2406_01321 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | Sequence-to-Sequence Multi-Modal Speech In-Painting Elyaderani, Mahsa Kadkhodaei Shirani, Shahram Sound Artificial Intelligence Machine Learning Multimedia Audio and Speech Processing Speech in-painting is the task of regenerating missing audio contents using reliable context information. Despite various recent studies in multi-modal perception of audio in-painting, there is still a need for an effective infusion of visual and auditory information in speech in-painting. In this paper, we introduce a novel sequence-to-sequence model that leverages the visual information to in-paint audio signals via an encoder-decoder architecture. The encoder plays the role of a lip-reader for facial recordings and the decoder takes both encoder outputs as well as the distorted audio spectrograms to restore the original speech. Our model outperforms an audio-only speech in-painting model and has comparable results with a recent multi-modal speech in-painter in terms of speech quality and intelligibility metrics for distortions of 300 ms to 1500 ms duration, which proves the effectiveness of the introduced multi-modality in speech in-painting. |
| title | Sequence-to-Sequence Multi-Modal Speech In-Painting |
| topic | Sound Artificial Intelligence Machine Learning Multimedia Audio and Speech Processing |
| url | https://arxiv.org/abs/2406.01321 |