Saved in:
Bibliographic Details
Main Authors: Mitsumori, Shunsuke, Kashiwagi, Sara, Tanaka, Keitaro, Morishima, Shigeo
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2506.22194
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913915123793920
author Mitsumori, Shunsuke
Kashiwagi, Sara
Tanaka, Keitaro
Morishima, Shigeo
author_facet Mitsumori, Shunsuke
Kashiwagi, Sara
Tanaka, Keitaro
Morishima, Shigeo
contents This paper presents a novel donor data selection method to enhance low-resource automatic speech recognition (ASR). While ASR performs well in high-resource languages, its accuracy declines in low-resource settings due to limited training data. A common solution is to leverage multilingual self-supervised learning (SSL) models with donor languages. However, existing methods rely on language-level similarity, overlooking clip-level variations. To address this limitation, we propose clip-wise acoustic token distribution similarity (CATDS), a fine-grained selection method that identifies acoustically relevant donor clips for better alignment with the target language. Unlike existing clip-level selection methods, our method aligns with the representation of SSL models and offers more challenging yet valuable samples. Experimental results show that CATDS outperforms traditional selection methods and can even utilize donor languages previously considered detrimental.
format Preprint
id arxiv_https___arxiv_org_abs_2506_22194
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech Recognition
Mitsumori, Shunsuke
Kashiwagi, Sara
Tanaka, Keitaro
Morishima, Shigeo
Audio and Speech Processing
This paper presents a novel donor data selection method to enhance low-resource automatic speech recognition (ASR). While ASR performs well in high-resource languages, its accuracy declines in low-resource settings due to limited training data. A common solution is to leverage multilingual self-supervised learning (SSL) models with donor languages. However, existing methods rely on language-level similarity, overlooking clip-level variations. To address this limitation, we propose clip-wise acoustic token distribution similarity (CATDS), a fine-grained selection method that identifies acoustically relevant donor clips for better alignment with the target language. Unlike existing clip-level selection methods, our method aligns with the representation of SSL models and offers more challenging yet valuable samples. Experimental results show that CATDS outperforms traditional selection methods and can even utilize donor languages previously considered detrimental.
title Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech Recognition
topic Audio and Speech Processing
url https://arxiv.org/abs/2506.22194