Medarbejdervisning: :: Library Catalog

Saved in:

Bibliografiske detaljer
Main Authors:	Raissi, Tina, Schlüter, Ralf, Ney, Hermann
Format:	Preprint
Udgivet:	2025
Fag:	Sound Audio and Speech Processing
Online adgang:	https://arxiv.org/abs/2501.04521
Tags:	Tilføj Tag Ingen Tags, Vær først til at tagge denne postø!

_version_	1866909452123242496
author	Raissi, Tina Schlüter, Ralf Ney, Hermann
author_facet	Raissi, Tina Schlüter, Ralf Ney, Hermann
contents	Current time-synchronous sequence-to-sequence automatic speech recognition (ASR) models are trained by using sequence level cross-entropy that sums over all alignments. Due to the discriminative formulation, incorporating the right label context into the training criterion's gradient causes normalization problems and is not mathematically well-defined. The classic hybrid neural network hidden Markov model (NN-HMM) with its inherent generative formulation enables conditioning on the right label context. However, due to the HMM state-tying the identity of the right label context is never modeled explicitly. In this work, we propose a factored loss with auxiliary left and right label contexts that sums over all alignments. We show that the inclusion of the right label context is particularly beneficial when training data resources are limited. Moreover, we also show that it is possible to build a factored hybrid HMM system by relying exclusively on the full-sum criterion. Experiments were conducted on Switchboard 300h and LibriSpeech 960h.
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_04521
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Right Label Context in End-to-End Training of Time-Synchronous ASR Models Raissi, Tina Schlüter, Ralf Ney, Hermann Sound Audio and Speech Processing Current time-synchronous sequence-to-sequence automatic speech recognition (ASR) models are trained by using sequence level cross-entropy that sums over all alignments. Due to the discriminative formulation, incorporating the right label context into the training criterion's gradient causes normalization problems and is not mathematically well-defined. The classic hybrid neural network hidden Markov model (NN-HMM) with its inherent generative formulation enables conditioning on the right label context. However, due to the HMM state-tying the identity of the right label context is never modeled explicitly. In this work, we propose a factored loss with auxiliary left and right label contexts that sums over all alignments. We show that the inclusion of the right label context is particularly beneficial when training data resources are limited. Moreover, we also show that it is possible to build a factored hybrid HMM system by relying exclusively on the full-sum criterion. Experiments were conducted on Switchboard 300h and LibriSpeech 960h.
title	Right Label Context in End-to-End Training of Time-Synchronous ASR Models
topic	Sound Audio and Speech Processing
url	https://arxiv.org/abs/2501.04521

Lignende værker