Saved in:
Bibliographic Details
Main Authors: Shi, Mohan, Jin, Zengrui, Xu, Yaoxun, Xu, Yong, Zhang, Shi-Xiong, Wei, Kun, Shao, Yiwen, Zhang, Chunlei, Yu, Dong
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2408.17431
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929479722467328
author Shi, Mohan
Jin, Zengrui
Xu, Yaoxun
Xu, Yong
Zhang, Shi-Xiong
Wei, Kun
Shao, Yiwen
Zhang, Chunlei
Yu, Dong
author_facet Shi, Mohan
Jin, Zengrui
Xu, Yaoxun
Xu, Yong
Zhang, Shi-Xiong
Wei, Kun
Shao, Yiwen
Zhang, Chunlei
Yu, Dong
contents Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcriptions, derived from concatenating multiple related utterances in a conversation, depend significantly on modeling long contexts. Therefore, compared to traditional methods that primarily emphasize encoder performance in attention-based encoder-decoder (AED) architectures, a novel approach utilizing large language models (LLMs) that leverages the capabilities of pre-trained decoders may be better suited for such complex and challenging scenarios. In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM, fine-tuning them on multi-talker dataset using appropriate strategies. Experimental results demonstrate that our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI, outperforming the AED model trained with 1000 times more supervised data in previous works.
format Preprint
id arxiv_https___arxiv_org_abs_2408_17431
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Advancing Multi-talker ASR Performance with Large Language Models
Shi, Mohan
Jin, Zengrui
Xu, Yaoxun
Xu, Yong
Zhang, Shi-Xiong
Wei, Kun
Shao, Yiwen
Zhang, Chunlei
Yu, Dong
Audio and Speech Processing
Artificial Intelligence
Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcriptions, derived from concatenating multiple related utterances in a conversation, depend significantly on modeling long contexts. Therefore, compared to traditional methods that primarily emphasize encoder performance in attention-based encoder-decoder (AED) architectures, a novel approach utilizing large language models (LLMs) that leverages the capabilities of pre-trained decoders may be better suited for such complex and challenging scenarios. In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM, fine-tuning them on multi-talker dataset using appropriate strategies. Experimental results demonstrate that our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI, outperforming the AED model trained with 1000 times more supervised data in previous works.
title Advancing Multi-talker ASR Performance with Large Language Models
topic Audio and Speech Processing
Artificial Intelligence
url https://arxiv.org/abs/2408.17431