Saved in:
Bibliographic Details
Main Authors: Liu, Wei, Hou, Jingyong, Yang, Dong, Cao, Muyong, Lee, Tan
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2401.03689
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912180546306048
author Liu, Wei
Hou, Jingyong
Yang, Dong
Cao, Muyong
Lee, Tan
author_facet Liu, Wei
Hou, Jingyong
Yang, Dong
Cao, Muyong
Lee, Tan
contents Toward high-performance multilingual automatic speech recognition (ASR), various types of linguistic information and model design have demonstrated their effectiveness independently. They include language identity (LID), phoneme information, language-specific processing modules, and cross-lingual self-supervised speech representation. It is expected that leveraging their benefits synergistically in a unified solution would further improve the overall system performance. This paper presents a novel design of a hierarchical information path, named LUPET, which sequentially encodes, from the shallow layers to deep layers, multiple aspects of linguistic and acoustic information at diverse granularity scales. The path starts from LID prediction, followed by acoustic unit discovery, phoneme sharing, and finally token recognition routed by a mixture-of-expert. ASR experiments are carried out on 10 languages in the Common Voice corpus. The results demonstrate the superior performance of LUPET as compared to the baseline systems. Most importantly, LUPET effectively mitigates the issue of performance compromise of high-resource languages with low-resource ones in the multilingual setting.
format Preprint
id arxiv_https___arxiv_org_abs_2401_03689
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle LUPET: Incorporating Hierarchical Information Path into Multilingual ASR
Liu, Wei
Hou, Jingyong
Yang, Dong
Cao, Muyong
Lee, Tan
Audio and Speech Processing
Sound
Toward high-performance multilingual automatic speech recognition (ASR), various types of linguistic information and model design have demonstrated their effectiveness independently. They include language identity (LID), phoneme information, language-specific processing modules, and cross-lingual self-supervised speech representation. It is expected that leveraging their benefits synergistically in a unified solution would further improve the overall system performance. This paper presents a novel design of a hierarchical information path, named LUPET, which sequentially encodes, from the shallow layers to deep layers, multiple aspects of linguistic and acoustic information at diverse granularity scales. The path starts from LID prediction, followed by acoustic unit discovery, phoneme sharing, and finally token recognition routed by a mixture-of-expert. ASR experiments are carried out on 10 languages in the Common Voice corpus. The results demonstrate the superior performance of LUPET as compared to the baseline systems. Most importantly, LUPET effectively mitigates the issue of performance compromise of high-resource languages with low-resource ones in the multilingual setting.
title LUPET: Incorporating Hierarchical Information Path into Multilingual ASR
topic Audio and Speech Processing
Sound
url https://arxiv.org/abs/2401.03689