MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Meng, Qingliang, Ren, Pengju, Li, Tian, Dai, Changsong, Liang, Huizhi
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Computation and Language Audio and Speech Processing
Accesso online:	https://arxiv.org/abs/2502.10058
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866909648870703104
author	Meng, Qingliang Ren, Pengju Li, Tian Dai, Changsong Liang, Huizhi
author_facet	Meng, Qingliang Ren, Pengju Li, Tian Dai, Changsong Liang, Huizhi
contents	Automatic speech recognition (ASR) systems normally consist of an acoustic model (AM) and a language model (LM). The acoustic model estimates the probability distribution of text given the input speech, while the language model calibrates this distribution toward a specific knowledge domain to produce the final transcription. Traditional ASR-specific LMs are typically trained in a unidirectional (left-to-right) manner to align with autoregressive decoding. However, this restricts the model from leveraging the right-side context during training, limiting its representational capacity. In this work, we propose MTLM, a novel training paradigm that unifies unidirectional and bidirectional manners through 3 training objectives: ULM, BMLM, and UMLM. This approach enhances the LM's ability to capture richer linguistic patterns from both left and right contexts while preserving compatibility with standard ASR autoregressive decoding methods. As a result, the MTLM model not only enhances the ASR system's performance but also support multiple decoding strategies, including shallow fusion, unidirectional/bidirectional n-best rescoring. Experiments on the LibriSpeech dataset show that MTLM consistently outperforms unidirectional training across multiple decoding strategies, highlighting its effectiveness and flexibility in ASR applications.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_10058
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems Meng, Qingliang Ren, Pengju Li, Tian Dai, Changsong Liang, Huizhi Computation and Language Audio and Speech Processing Automatic speech recognition (ASR) systems normally consist of an acoustic model (AM) and a language model (LM). The acoustic model estimates the probability distribution of text given the input speech, while the language model calibrates this distribution toward a specific knowledge domain to produce the final transcription. Traditional ASR-specific LMs are typically trained in a unidirectional (left-to-right) manner to align with autoregressive decoding. However, this restricts the model from leveraging the right-side context during training, limiting its representational capacity. In this work, we propose MTLM, a novel training paradigm that unifies unidirectional and bidirectional manners through 3 training objectives: ULM, BMLM, and UMLM. This approach enhances the LM's ability to capture richer linguistic patterns from both left and right contexts while preserving compatibility with standard ASR autoregressive decoding methods. As a result, the MTLM model not only enhances the ASR system's performance but also support multiple decoding strategies, including shallow fusion, unidirectional/bidirectional n-best rescoring. Experiments on the LibriSpeech dataset show that MTLM consistently outperforms unidirectional training across multiple decoding strategies, highlighting its effectiveness and flexibility in ASR applications.
title	MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems
topic	Computation and Language Audio and Speech Processing
url	https://arxiv.org/abs/2502.10058

Documenti analoghi