Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Lo-Ya, Lo, Tien-Hong, Hung, Jeih-Weih, Huang, Shih-Chieh, Chen, Berlin
Format:	Preprint
Published:	2026
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2604.03689
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910102999531520
author	Li, Lo-Ya Lo, Tien-Hong Hung, Jeih-Weih Huang, Shih-Chieh Chen, Berlin
author_facet	Li, Lo-Ya Lo, Tien-Hong Hung, Jeih-Weih Huang, Shih-Chieh Chen, Berlin
contents	User-defined keyword spotting (KWS) without resorting to domain-specific pre-labeled training data is of fundamental importance in building adaptable and personalized voice interfaces. However, such systems are still faced with arduous challenges, including constrained computational resources and limited annotated training data. Existing methods also struggle to distinguish acoustically similar keywords, often leading to a pesky false alarm rate (FAR) in real-world deployments. To mitigate these limitations, we put forward MALEFA, a novel lightweight zero-shot KWS framework that jointly learns utterance- and phoneme-level alignments via cross-attention and a multi-granularity contrastive learning objective. Evaluations on four public benchmark datasets show that MALEFA achieves a high accuracy of 90%, significantly reducing FAR to 0.007% on the AMI dataset. Beyond its strong performance, MALEFA demonstrates high computational efficiency and can readily support real-time deployment on resource-constrained devices.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_03689
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	MALEFA: Multi-grAnularity Learning and Effective False Alarm Suppression for Zero-shot Keyword Spotting Li, Lo-Ya Lo, Tien-Hong Hung, Jeih-Weih Huang, Shih-Chieh Chen, Berlin Audio and Speech Processing User-defined keyword spotting (KWS) without resorting to domain-specific pre-labeled training data is of fundamental importance in building adaptable and personalized voice interfaces. However, such systems are still faced with arduous challenges, including constrained computational resources and limited annotated training data. Existing methods also struggle to distinguish acoustically similar keywords, often leading to a pesky false alarm rate (FAR) in real-world deployments. To mitigate these limitations, we put forward MALEFA, a novel lightweight zero-shot KWS framework that jointly learns utterance- and phoneme-level alignments via cross-attention and a multi-granularity contrastive learning objective. Evaluations on four public benchmark datasets show that MALEFA achieves a high accuracy of 90%, significantly reducing FAR to 0.007% on the AMI dataset. Beyond its strong performance, MALEFA demonstrates high computational efficiency and can readily support real-time deployment on resource-constrained devices.
title	MALEFA: Multi-grAnularity Learning and Effective False Alarm Suppression for Zero-shot Keyword Spotting
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2604.03689

Similar Items