Saved in:
Bibliographic Details
Main Authors: Fatemeh, Allahdadi, Rahil, Mahdian Toroghi, Hassan, Zareian
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2410.04091
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914965627076608
author Fatemeh, Allahdadi
Rahil, Mahdian Toroghi
Hassan, Zareian
author_facet Fatemeh, Allahdadi
Rahil, Mahdian Toroghi
Hassan, Zareian
contents Query-by-example spoken term detection (QbE-STD) is typically constrained by transcribed data scarcity and language specificity. This paper introduces a novel, language-agnostic QbE-STD model leveraging image processing techniques and transformer architecture. By employing a pre-trained XLSR-53 network for feature extraction and a Hough transform for detection, our model effectively searches for user-defined spoken terms within any audio file. Experimental results across four languages demonstrate significant performance gains (19-54%) over a CNN-based baseline. While processing time is improved compared to DTW, accuracy remains inferior. Notably, our model offers the advantage of accurately counting query term repetitions within the target audio.
format Preprint
id arxiv_https___arxiv_org_abs_2410_04091
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach
Fatemeh, Allahdadi
Rahil, Mahdian Toroghi
Hassan, Zareian
Machine Learning
Sound
Audio and Speech Processing
Query-by-example spoken term detection (QbE-STD) is typically constrained by transcribed data scarcity and language specificity. This paper introduces a novel, language-agnostic QbE-STD model leveraging image processing techniques and transformer architecture. By employing a pre-trained XLSR-53 network for feature extraction and a Hough transform for detection, our model effectively searches for user-defined spoken terms within any audio file. Experimental results across four languages demonstrate significant performance gains (19-54%) over a CNN-based baseline. While processing time is improved compared to DTW, accuracy remains inferior. Notably, our model offers the advantage of accurately counting query term repetitions within the target audio.
title Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach
topic Machine Learning
Sound
Audio and Speech Processing
url https://arxiv.org/abs/2410.04091