Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kong, Xiangzhu, Ning, Tianqi, Huang, Hao, Ou, Zhijian
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2407.09807
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913503010357248
author	Kong, Xiangzhu Ning, Tianqi Huang, Hao Ou, Zhijian
author_facet	Kong, Xiangzhu Ning, Tianqi Huang, Hao Ou, Zhijian
contents	Recently multi-channel end-to-end (ME2E) ASR systems have emerged. While streaming single-channel end-to-end ASR has been extensively studied, streaming ME2E ASR is limited in exploration. Additionally, recent studies call attention to the gap between in-distribution (ID) and out-of-distribution (OOD) tests and doing realistic evaluations. This paper focuses on two research problems: realizing streaming ME2E ASR and improving OOD generalization. We propose the CUSIDE-array method, which integrates the recent CUSIDE methodology (Chunking, Simulating Future Context and Decoding) into the neural beamformer approach of ME2E ASR. It enables streaming processing of both front-end and back-end with a total latency of 402ms. The CUSIDE-array ME2E models are shown to achieve superior streaming results in both ID and OOD tests. Realistic evaluations confirm the advantage of CUSIDE-array in its capability to consume single-channel data to improve OOD generalization via back-end pre-training and ME2E fine-tuning.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_09807
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	CUSIDE-array: A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations Kong, Xiangzhu Ning, Tianqi Huang, Hao Ou, Zhijian Audio and Speech Processing Recently multi-channel end-to-end (ME2E) ASR systems have emerged. While streaming single-channel end-to-end ASR has been extensively studied, streaming ME2E ASR is limited in exploration. Additionally, recent studies call attention to the gap between in-distribution (ID) and out-of-distribution (OOD) tests and doing realistic evaluations. This paper focuses on two research problems: realizing streaming ME2E ASR and improving OOD generalization. We propose the CUSIDE-array method, which integrates the recent CUSIDE methodology (Chunking, Simulating Future Context and Decoding) into the neural beamformer approach of ME2E ASR. It enables streaming processing of both front-end and back-end with a total latency of 402ms. The CUSIDE-array ME2E models are shown to achieve superior streaming results in both ID and OOD tests. Realistic evaluations confirm the advantage of CUSIDE-array in its capability to consume single-channel data to improve OOD generalization via back-end pre-training and ME2E fine-tuning.
title	CUSIDE-array: A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2407.09807

Similar Items