Salvato in:
Dettagli Bibliografici
Autori principali: Meng, Qingliang, Ren, Pengju, Li, Tian, Dai, Changsong, Liang, Huizhi
Natura: Preprint
Pubblicazione: 2025
Soggetti:
Accesso online:https://arxiv.org/abs/2502.10058
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866909648870703104
author Meng, Qingliang
Ren, Pengju
Li, Tian
Dai, Changsong
Liang, Huizhi
author_facet Meng, Qingliang
Ren, Pengju
Li, Tian
Dai, Changsong
Liang, Huizhi
contents Automatic speech recognition (ASR) systems normally consist of an acoustic model (AM) and a language model (LM). The acoustic model estimates the probability distribution of text given the input speech, while the language model calibrates this distribution toward a specific knowledge domain to produce the final transcription. Traditional ASR-specific LMs are typically trained in a unidirectional (left-to-right) manner to align with autoregressive decoding. However, this restricts the model from leveraging the right-side context during training, limiting its representational capacity. In this work, we propose MTLM, a novel training paradigm that unifies unidirectional and bidirectional manners through 3 training objectives: ULM, BMLM, and UMLM. This approach enhances the LM's ability to capture richer linguistic patterns from both left and right contexts while preserving compatibility with standard ASR autoregressive decoding methods. As a result, the MTLM model not only enhances the ASR system's performance but also support multiple decoding strategies, including shallow fusion, unidirectional/bidirectional n-best rescoring. Experiments on the LibriSpeech dataset show that MTLM consistently outperforms unidirectional training across multiple decoding strategies, highlighting its effectiveness and flexibility in ASR applications.
format Preprint
id arxiv_https___arxiv_org_abs_2502_10058
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems
Meng, Qingliang
Ren, Pengju
Li, Tian
Dai, Changsong
Liang, Huizhi
Computation and Language
Audio and Speech Processing
Automatic speech recognition (ASR) systems normally consist of an acoustic model (AM) and a language model (LM). The acoustic model estimates the probability distribution of text given the input speech, while the language model calibrates this distribution toward a specific knowledge domain to produce the final transcription. Traditional ASR-specific LMs are typically trained in a unidirectional (left-to-right) manner to align with autoregressive decoding. However, this restricts the model from leveraging the right-side context during training, limiting its representational capacity. In this work, we propose MTLM, a novel training paradigm that unifies unidirectional and bidirectional manners through 3 training objectives: ULM, BMLM, and UMLM. This approach enhances the LM's ability to capture richer linguistic patterns from both left and right contexts while preserving compatibility with standard ASR autoregressive decoding methods. As a result, the MTLM model not only enhances the ASR system's performance but also support multiple decoding strategies, including shallow fusion, unidirectional/bidirectional n-best rescoring. Experiments on the LibriSpeech dataset show that MTLM consistently outperforms unidirectional training across multiple decoding strategies, highlighting its effectiveness and flexibility in ASR applications.
title MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems
topic Computation and Language
Audio and Speech Processing
url https://arxiv.org/abs/2502.10058