Saved in:
Bibliographic Details
Main Authors: Guo, Jinxi, Moritz, Niko, Ma, Yingyi, Seide, Frank, Wu, Chunyang, Mahadeokar, Jay, Kalinli, Ozlem, Fuegen, Christian, Seltzer, Mike
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2404.01716
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911823967551488
author Guo, Jinxi
Moritz, Niko
Ma, Yingyi
Seide, Frank
Wu, Chunyang
Mahadeokar, Jay
Kalinli, Ozlem
Fuegen, Christian
Seltzer, Mike
author_facet Guo, Jinxi
Moritz, Niko
Ma, Yingyi
Seide, Frank
Wu, Chunyang
Mahadeokar, Jay
Kalinli, Ozlem
Fuegen, Christian
Seltzer, Mike
contents The internal language model (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with external language models. Recently, various of factorized transducer models have been proposed, which explicitly embrace a standalone internal language model for non-blank token prediction. However, even with the adoption of factorized transducer models, limited improvement has been observed compared to shallow fusion. In this paper, we propose a novel ILM training and decoding strategy for factorized transducer models, which effectively combines the blank, acoustic and ILM scores. Our experiments show a 17% relative improvement over the standard decoding method when utilizing a well-trained ILM and the proposed decoding strategy on LibriSpeech datasets. Furthermore, when compared to a strong RNN-T baseline enhanced with external LM fusion, the proposed model yields a 5.5% relative improvement on general-sets and an 8.9% WER reduction for rare words. The proposed model can achieve superior performance without relying on external language models, rendering it highly efficient for production use-cases. To further improve the performance, we propose a novel and memory-efficient ILM-fusion-aware minimum word error rate (MWER) training method which improves ILM integration significantly.
format Preprint
id arxiv_https___arxiv_org_abs_2404_01716
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Effective internal language model training and fusion for factorized transducer model
Guo, Jinxi
Moritz, Niko
Ma, Yingyi
Seide, Frank
Wu, Chunyang
Mahadeokar, Jay
Kalinli, Ozlem
Fuegen, Christian
Seltzer, Mike
Audio and Speech Processing
Artificial Intelligence
Computation and Language
Machine Learning
The internal language model (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with external language models. Recently, various of factorized transducer models have been proposed, which explicitly embrace a standalone internal language model for non-blank token prediction. However, even with the adoption of factorized transducer models, limited improvement has been observed compared to shallow fusion. In this paper, we propose a novel ILM training and decoding strategy for factorized transducer models, which effectively combines the blank, acoustic and ILM scores. Our experiments show a 17% relative improvement over the standard decoding method when utilizing a well-trained ILM and the proposed decoding strategy on LibriSpeech datasets. Furthermore, when compared to a strong RNN-T baseline enhanced with external LM fusion, the proposed model yields a 5.5% relative improvement on general-sets and an 8.9% WER reduction for rare words. The proposed model can achieve superior performance without relying on external language models, rendering it highly efficient for production use-cases. To further improve the performance, we propose a novel and memory-efficient ILM-fusion-aware minimum word error rate (MWER) training method which improves ILM integration significantly.
title Effective internal language model training and fusion for factorized transducer model
topic Audio and Speech Processing
Artificial Intelligence
Computation and Language
Machine Learning
url https://arxiv.org/abs/2404.01716