Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ai, Zhiqi, Chen, Zhiyong, Xu, Shugong
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Computation and Language Sound
Online Access:	https://arxiv.org/abs/2406.07310
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917690491273216
author	Ai, Zhiqi Chen, Zhiyong Xu, Shugong
author_facet	Ai, Zhiqi Chen, Zhiyong Xu, Shugong
contents	In this paper, we propose MM-KWS, a novel approach to user-defined keyword spotting leveraging multi-modal enrollments of text and speech templates. Unlike previous methods that focus solely on either text or speech features, MM-KWS extracts phoneme, text, and speech embeddings from both modalities. These embeddings are then compared with the query speech embedding to detect the target keywords. To ensure the applicability of MM-KWS across diverse languages, we utilize a feature extractor incorporating several multilingual pre-trained models. Subsequently, we validate its effectiveness on Mandarin and English tasks. In addition, we have integrated advanced data augmentation tools for hard case mining to enhance MM-KWS in distinguishing confusable words. Experimental results on the LibriPhrase and WenetPhrase datasets demonstrate that MM-KWS outperforms prior methods significantly.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_07310
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting Ai, Zhiqi Chen, Zhiyong Xu, Shugong Audio and Speech Processing Computation and Language Sound In this paper, we propose MM-KWS, a novel approach to user-defined keyword spotting leveraging multi-modal enrollments of text and speech templates. Unlike previous methods that focus solely on either text or speech features, MM-KWS extracts phoneme, text, and speech embeddings from both modalities. These embeddings are then compared with the query speech embedding to detect the target keywords. To ensure the applicability of MM-KWS across diverse languages, we utilize a feature extractor incorporating several multilingual pre-trained models. Subsequently, we validate its effectiveness on Mandarin and English tasks. In addition, we have integrated advanced data augmentation tools for hard case mining to enhance MM-KWS in distinguishing confusable words. Experimental results on the LibriPhrase and WenetPhrase datasets demonstrate that MM-KWS outperforms prior methods significantly.
title	MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting
topic	Audio and Speech Processing Computation and Language Sound
url	https://arxiv.org/abs/2406.07310

Similar Items