Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.04091 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866914965627076608 |
|---|---|
| author | Fatemeh, Allahdadi Rahil, Mahdian Toroghi Hassan, Zareian |
| author_facet | Fatemeh, Allahdadi Rahil, Mahdian Toroghi Hassan, Zareian |
| contents | Query-by-example spoken term detection (QbE-STD) is typically constrained by transcribed data scarcity and language specificity. This paper introduces a novel, language-agnostic QbE-STD model leveraging image processing techniques and transformer architecture. By employing a pre-trained XLSR-53 network for feature extraction and a Hough transform for detection, our model effectively searches for user-defined spoken terms within any audio file. Experimental results across four languages demonstrate significant performance gains (19-54%) over a CNN-based baseline. While processing time is improved compared to DTW, accuracy remains inferior. Notably, our model offers the advantage of accurately counting query term repetitions within the target audio. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2410_04091 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach Fatemeh, Allahdadi Rahil, Mahdian Toroghi Hassan, Zareian Machine Learning Sound Audio and Speech Processing Query-by-example spoken term detection (QbE-STD) is typically constrained by transcribed data scarcity and language specificity. This paper introduces a novel, language-agnostic QbE-STD model leveraging image processing techniques and transformer architecture. By employing a pre-trained XLSR-53 network for feature extraction and a Hough transform for detection, our model effectively searches for user-defined spoken terms within any audio file. Experimental results across four languages demonstrate significant performance gains (19-54%) over a CNN-based baseline. While processing time is improved compared to DTW, accuracy remains inferior. Notably, our model offers the advantage of accurately counting query term repetitions within the target audio. |
| title | Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach |
| topic | Machine Learning Sound Audio and Speech Processing |
| url | https://arxiv.org/abs/2410.04091 |