Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.09807 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913503010357248 |
|---|---|
| author | Kong, Xiangzhu Ning, Tianqi Huang, Hao Ou, Zhijian |
| author_facet | Kong, Xiangzhu Ning, Tianqi Huang, Hao Ou, Zhijian |
| contents | Recently multi-channel end-to-end (ME2E) ASR systems have emerged. While streaming single-channel end-to-end ASR has been extensively studied, streaming ME2E ASR is limited in exploration. Additionally, recent studies call attention to the gap between in-distribution (ID) and out-of-distribution (OOD) tests and doing realistic evaluations. This paper focuses on two research problems: realizing streaming ME2E ASR and improving OOD generalization. We propose the CUSIDE-array method, which integrates the recent CUSIDE methodology (Chunking, Simulating Future Context and Decoding) into the neural beamformer approach of ME2E ASR. It enables streaming processing of both front-end and back-end with a total latency of 402ms. The CUSIDE-array ME2E models are shown to achieve superior streaming results in both ID and OOD tests. Realistic evaluations confirm the advantage of CUSIDE-array in its capability to consume single-channel data to improve OOD generalization via back-end pre-training and ME2E fine-tuning. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2407_09807 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | CUSIDE-array: A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations Kong, Xiangzhu Ning, Tianqi Huang, Hao Ou, Zhijian Audio and Speech Processing Recently multi-channel end-to-end (ME2E) ASR systems have emerged. While streaming single-channel end-to-end ASR has been extensively studied, streaming ME2E ASR is limited in exploration. Additionally, recent studies call attention to the gap between in-distribution (ID) and out-of-distribution (OOD) tests and doing realistic evaluations. This paper focuses on two research problems: realizing streaming ME2E ASR and improving OOD generalization. We propose the CUSIDE-array method, which integrates the recent CUSIDE methodology (Chunking, Simulating Future Context and Decoding) into the neural beamformer approach of ME2E ASR. It enables streaming processing of both front-end and back-end with a total latency of 402ms. The CUSIDE-array ME2E models are shown to achieve superior streaming results in both ID and OOD tests. Realistic evaluations confirm the advantage of CUSIDE-array in its capability to consume single-channel data to improve OOD generalization via back-end pre-training and ME2E fine-tuning. |
| title | CUSIDE-array: A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations |
| topic | Audio and Speech Processing |
| url | https://arxiv.org/abs/2407.09807 |