Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shi, Mohan, Jin, Zengrui, Xu, Yaoxun, Xu, Yong, Zhang, Shi-Xiong, Wei, Kun, Shao, Yiwen, Zhang, Chunlei, Yu, Dong
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2408.17431
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929479722467328
author	Shi, Mohan Jin, Zengrui Xu, Yaoxun Xu, Yong Zhang, Shi-Xiong Wei, Kun Shao, Yiwen Zhang, Chunlei Yu, Dong
author_facet	Shi, Mohan Jin, Zengrui Xu, Yaoxun Xu, Yong Zhang, Shi-Xiong Wei, Kun Shao, Yiwen Zhang, Chunlei Yu, Dong
contents	Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcriptions, derived from concatenating multiple related utterances in a conversation, depend significantly on modeling long contexts. Therefore, compared to traditional methods that primarily emphasize encoder performance in attention-based encoder-decoder (AED) architectures, a novel approach utilizing large language models (LLMs) that leverages the capabilities of pre-trained decoders may be better suited for such complex and challenging scenarios. In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM, fine-tuning them on multi-talker dataset using appropriate strategies. Experimental results demonstrate that our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI, outperforming the AED model trained with 1000 times more supervised data in previous works.
format	Preprint
id	arxiv_https___arxiv_org_abs_2408_17431
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Advancing Multi-talker ASR Performance with Large Language Models Shi, Mohan Jin, Zengrui Xu, Yaoxun Xu, Yong Zhang, Shi-Xiong Wei, Kun Shao, Yiwen Zhang, Chunlei Yu, Dong Audio and Speech Processing Artificial Intelligence Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcriptions, derived from concatenating multiple related utterances in a conversation, depend significantly on modeling long contexts. Therefore, compared to traditional methods that primarily emphasize encoder performance in attention-based encoder-decoder (AED) architectures, a novel approach utilizing large language models (LLMs) that leverages the capabilities of pre-trained decoders may be better suited for such complex and challenging scenarios. In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM, fine-tuning them on multi-talker dataset using appropriate strategies. Experimental results demonstrate that our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI, outperforming the AED model trained with 1000 times more supervised data in previous works.
title	Advancing Multi-talker ASR Performance with Large Language Models
topic	Audio and Speech Processing Artificial Intelligence
url	https://arxiv.org/abs/2408.17431

Similar Items