Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Agarwal, Dhruuv, Zhang, Harry, Yu, Yang, Wang, Quan
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing Sound
Online Access:	https://arxiv.org/abs/2509.15516
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915811666427904
author	Agarwal, Dhruuv Zhang, Harry Yu, Yang Wang, Quan
author_facet	Agarwal, Dhruuv Zhang, Harry Yu, Yang Wang, Quan
contents	Personalizing dysarthric ASR is hindered by demanding enrollment collection and per-user training. We propose a hybrid meta-training method for a single model, enabling zero-shot and few-shot on-the-fly personalization via in-context learning (ICL). On Euphonia, it achieves 13.9% Word Error Rate (WER), surpassing speaker-independent baselines (17.5%). On SAP Test-1, our 5.3% WER outperforms the challenge-winning team (5.97%). On Test-2, our 9.49% trails only the winner (8.11%) but without relying on techniques like offline model-merging or custom audio chunking. Curation yields a 40% WER reduction using random same-speaker examples, validating active personalization. While static text curation fails to beat this baseline, oracle similarity reveals substantial headroom, highlighting dynamic acoustic retrieval as the next frontier. Data ablations confirm rapid low-resource speaker adaptation, establishing the model as a practical personalized solution.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_15516
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	The Universal Personalizer: Few-Shot Dysarthric Speech Recognition via Meta-Learning Agarwal, Dhruuv Zhang, Harry Yu, Yang Wang, Quan Audio and Speech Processing Sound Personalizing dysarthric ASR is hindered by demanding enrollment collection and per-user training. We propose a hybrid meta-training method for a single model, enabling zero-shot and few-shot on-the-fly personalization via in-context learning (ICL). On Euphonia, it achieves 13.9% Word Error Rate (WER), surpassing speaker-independent baselines (17.5%). On SAP Test-1, our 5.3% WER outperforms the challenge-winning team (5.97%). On Test-2, our 9.49% trails only the winner (8.11%) but without relying on techniques like offline model-merging or custom audio chunking. Curation yields a 40% WER reduction using random same-speaker examples, validating active personalization. While static text curation fails to beat this baseline, oracle similarity reveals substantial headroom, highlighting dynamic acoustic retrieval as the next frontier. Data ablations confirm rapid low-resource speaker adaptation, establishing the model as a practical personalized solution.
title	The Universal Personalizer: Few-Shot Dysarthric Speech Recognition via Meta-Learning
topic	Audio and Speech Processing Sound
url	https://arxiv.org/abs/2509.15516

Similar Items