Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mitsumori, Shunsuke, Kashiwagi, Sara, Tanaka, Keitaro, Morishima, Shigeo
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2506.22194
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913915123793920
author	Mitsumori, Shunsuke Kashiwagi, Sara Tanaka, Keitaro Morishima, Shigeo
author_facet	Mitsumori, Shunsuke Kashiwagi, Sara Tanaka, Keitaro Morishima, Shigeo
contents	This paper presents a novel donor data selection method to enhance low-resource automatic speech recognition (ASR). While ASR performs well in high-resource languages, its accuracy declines in low-resource settings due to limited training data. A common solution is to leverage multilingual self-supervised learning (SSL) models with donor languages. However, existing methods rely on language-level similarity, overlooking clip-level variations. To address this limitation, we propose clip-wise acoustic token distribution similarity (CATDS), a fine-grained selection method that identifies acoustically relevant donor clips for better alignment with the target language. Unlike existing clip-level selection methods, our method aligns with the representation of SSL models and offers more challenging yet valuable samples. Experimental results show that CATDS outperforms traditional selection methods and can even utilize donor languages previously considered detrimental.
format	Preprint
id	arxiv_https___arxiv_org_abs_2506_22194
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech Recognition Mitsumori, Shunsuke Kashiwagi, Sara Tanaka, Keitaro Morishima, Shigeo Audio and Speech Processing This paper presents a novel donor data selection method to enhance low-resource automatic speech recognition (ASR). While ASR performs well in high-resource languages, its accuracy declines in low-resource settings due to limited training data. A common solution is to leverage multilingual self-supervised learning (SSL) models with donor languages. However, existing methods rely on language-level similarity, overlooking clip-level variations. To address this limitation, we propose clip-wise acoustic token distribution similarity (CATDS), a fine-grained selection method that identifies acoustically relevant donor clips for better alignment with the target language. Unlike existing clip-level selection methods, our method aligns with the representation of SSL models and offers more challenging yet valuable samples. Experimental results show that CATDS outperforms traditional selection methods and can even utilize donor languages previously considered detrimental.
title	Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech Recognition
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2506.22194

Similar Items