Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wan, Genshun, Wang, Mengzhi, Mao, Tingzhi, Chen, Hang, Ye, Zhongfu
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2409.13698
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912099370795008
author	Wan, Genshun Wang, Mengzhi Mao, Tingzhi Chen, Hang Ye, Zhongfu
author_facet	Wan, Genshun Wang, Mengzhi Mao, Tingzhi Chen, Hang Ye, Zhongfu
contents	The transducer model trained based on sequence-level criterion requires a lot of memory due to the generation of the large probability matrix. We proposed a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame. Then the encoder output can be combined with the decoder output at the corresponding time, rather than adding each element output by the encoder to each element output by the decoder as in the transducer. This significantly reduces memory and computation requirements. To address the problem of imbalanced classification caused by excessive blanks in the label, we decouple the blank and non-blank probabilities and truncate the gradient of the blank classifier to the main network. Experiments on the AISHELL-1 demonstrate that this enables the lightweight transducer to achieve similar results to transducer. Additionally, we use richer information to predict the probability of blank, achieving superior results to transducer.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_13698
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Lightweight Transducer Based on Frame-Level Criterion Wan, Genshun Wang, Mengzhi Mao, Tingzhi Chen, Hang Ye, Zhongfu Computation and Language Sound Audio and Speech Processing The transducer model trained based on sequence-level criterion requires a lot of memory due to the generation of the large probability matrix. We proposed a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame. Then the encoder output can be combined with the decoder output at the corresponding time, rather than adding each element output by the encoder to each element output by the decoder as in the transducer. This significantly reduces memory and computation requirements. To address the problem of imbalanced classification caused by excessive blanks in the label, we decouple the blank and non-blank probabilities and truncate the gradient of the blank classifier to the main network. Experiments on the AISHELL-1 demonstrate that this enables the lightweight transducer to achieve similar results to transducer. Additionally, we use richer information to predict the probability of blank, achieving superior results to transducer.
title	Lightweight Transducer Based on Frame-Level Criterion
topic	Computation and Language Sound Audio and Speech Processing
url	https://arxiv.org/abs/2409.13698

Similar Items