Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.16751 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866912442395656192 |
|---|---|
| author | Singh, Akanksha Chen, Yi-Ping Phoebe Arora, Vipul |
| author_facet | Singh, Akanksha Chen, Yi-Ping Phoebe Arora, Vipul |
| contents | Query-by-example spoken term detection (QbE-STD) searches for matching words or phrases in an audio dataset using a sample spoken query. When annotated data is limited or unavailable, QbE-STD is often done using template matching methods like dynamic time warping (DTW), which are computationally expensive and do not scale well. To address this, we propose H-QuEST (Hierarchical Query-by-Example Spoken Term Detection), a novel framework that accelerates spoken term retrieval by utilizing Term Frequency and Inverse Document Frequency (TF-IDF)-based sparse representations obtained through advanced audio representation learning techniques and Hierarchical Navigable Small World (HNSW) indexing with further refinement. Experimental results show that H-QuEST delivers substantial improvements in retrieval speed without sacrificing accuracy compared to existing methods. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2506_16751 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | H-QuEST: Accelerating Query-by-Example Spoken Term Detection with Hierarchical Indexing Singh, Akanksha Chen, Yi-Ping Phoebe Arora, Vipul Audio and Speech Processing Query-by-example spoken term detection (QbE-STD) searches for matching words or phrases in an audio dataset using a sample spoken query. When annotated data is limited or unavailable, QbE-STD is often done using template matching methods like dynamic time warping (DTW), which are computationally expensive and do not scale well. To address this, we propose H-QuEST (Hierarchical Query-by-Example Spoken Term Detection), a novel framework that accelerates spoken term retrieval by utilizing Term Frequency and Inverse Document Frequency (TF-IDF)-based sparse representations obtained through advanced audio representation learning techniques and Hierarchical Navigable Small World (HNSW) indexing with further refinement. Experimental results show that H-QuEST delivers substantial improvements in retrieval speed without sacrificing accuracy compared to existing methods. |
| title | H-QuEST: Accelerating Query-by-Example Spoken Term Detection with Hierarchical Indexing |
| topic | Audio and Speech Processing |
| url | https://arxiv.org/abs/2506.16751 |