Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Singh, Akanksha, Chen, Yi-Ping Phoebe, Arora, Vipul
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2506.16751
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912442395656192
author	Singh, Akanksha Chen, Yi-Ping Phoebe Arora, Vipul
author_facet	Singh, Akanksha Chen, Yi-Ping Phoebe Arora, Vipul
contents	Query-by-example spoken term detection (QbE-STD) searches for matching words or phrases in an audio dataset using a sample spoken query. When annotated data is limited or unavailable, QbE-STD is often done using template matching methods like dynamic time warping (DTW), which are computationally expensive and do not scale well. To address this, we propose H-QuEST (Hierarchical Query-by-Example Spoken Term Detection), a novel framework that accelerates spoken term retrieval by utilizing Term Frequency and Inverse Document Frequency (TF-IDF)-based sparse representations obtained through advanced audio representation learning techniques and Hierarchical Navigable Small World (HNSW) indexing with further refinement. Experimental results show that H-QuEST delivers substantial improvements in retrieval speed without sacrificing accuracy compared to existing methods.
format	Preprint
id	arxiv_https___arxiv_org_abs_2506_16751
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	H-QuEST: Accelerating Query-by-Example Spoken Term Detection with Hierarchical Indexing Singh, Akanksha Chen, Yi-Ping Phoebe Arora, Vipul Audio and Speech Processing Query-by-example spoken term detection (QbE-STD) searches for matching words or phrases in an audio dataset using a sample spoken query. When annotated data is limited or unavailable, QbE-STD is often done using template matching methods like dynamic time warping (DTW), which are computationally expensive and do not scale well. To address this, we propose H-QuEST (Hierarchical Query-by-Example Spoken Term Detection), a novel framework that accelerates spoken term retrieval by utilizing Term Frequency and Inverse Document Frequency (TF-IDF)-based sparse representations obtained through advanced audio representation learning techniques and Hierarchical Navigable Small World (HNSW) indexing with further refinement. Experimental results show that H-QuEST delivers substantial improvements in retrieval speed without sacrificing accuracy compared to existing methods.
title	H-QuEST: Accelerating Query-by-Example Spoken Term Detection with Hierarchical Indexing
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2506.16751

Similar Items