Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Wei, Hou, Jingyong, Yang, Dong, Cao, Muyong, Lee, Tan
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Sound
Online Access:	https://arxiv.org/abs/2401.03689
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912180546306048
author	Liu, Wei Hou, Jingyong Yang, Dong Cao, Muyong Lee, Tan
author_facet	Liu, Wei Hou, Jingyong Yang, Dong Cao, Muyong Lee, Tan
contents	Toward high-performance multilingual automatic speech recognition (ASR), various types of linguistic information and model design have demonstrated their effectiveness independently. They include language identity (LID), phoneme information, language-specific processing modules, and cross-lingual self-supervised speech representation. It is expected that leveraging their benefits synergistically in a unified solution would further improve the overall system performance. This paper presents a novel design of a hierarchical information path, named LUPET, which sequentially encodes, from the shallow layers to deep layers, multiple aspects of linguistic and acoustic information at diverse granularity scales. The path starts from LID prediction, followed by acoustic unit discovery, phoneme sharing, and finally token recognition routed by a mixture-of-expert. ASR experiments are carried out on 10 languages in the Common Voice corpus. The results demonstrate the superior performance of LUPET as compared to the baseline systems. Most importantly, LUPET effectively mitigates the issue of performance compromise of high-resource languages with low-resource ones in the multilingual setting.
format	Preprint
id	arxiv_https___arxiv_org_abs_2401_03689
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	LUPET: Incorporating Hierarchical Information Path into Multilingual ASR Liu, Wei Hou, Jingyong Yang, Dong Cao, Muyong Lee, Tan Audio and Speech Processing Sound Toward high-performance multilingual automatic speech recognition (ASR), various types of linguistic information and model design have demonstrated their effectiveness independently. They include language identity (LID), phoneme information, language-specific processing modules, and cross-lingual self-supervised speech representation. It is expected that leveraging their benefits synergistically in a unified solution would further improve the overall system performance. This paper presents a novel design of a hierarchical information path, named LUPET, which sequentially encodes, from the shallow layers to deep layers, multiple aspects of linguistic and acoustic information at diverse granularity scales. The path starts from LID prediction, followed by acoustic unit discovery, phoneme sharing, and finally token recognition routed by a mixture-of-expert. ASR experiments are carried out on 10 languages in the Common Voice corpus. The results demonstrate the superior performance of LUPET as compared to the baseline systems. Most importantly, LUPET effectively mitigates the issue of performance compromise of high-resource languages with low-resource ones in the multilingual setting.
title	LUPET: Incorporating Hierarchical Information Path into Multilingual ASR
topic	Audio and Speech Processing Sound
url	https://arxiv.org/abs/2401.03689

Similar Items