Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shi, Ying, Li, Lantian, Yin, Shi, Wang, Dong, Han, Jiqing
Format:	Preprint
Published:	2024
Subjects:	Sound Artificial Intelligence Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2407.03966
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929410296250368
author	Shi, Ying Li, Lantian Yin, Shi Wang, Dong Han, Jiqing
author_facet	Shi, Ying Li, Lantian Yin, Shi Wang, Dong Han, Jiqing
contents	Serialized Output Training (SOT) has showcased state-of-the-art performance in multi-talker speech recognition by sequentially decoding the speech of individual speakers. To address the challenging label-permutation issue, prior methods have relied on either the Permutation Invariant Training (PIT) or the time-based First-In-First-Out (FIFO) rule. This study presents a model-based serialization strategy that incorporates an auxiliary module into the Attention Encoder-Decoder architecture, autonomously identifying the crucial factors to order the output sequence of the speech components in multi-talker speech. Experiments conducted on the LibriSpeech and LibriMix databases reveal that our approach significantly outperforms the PIT and FIFO baselines in both 2-mix and 3-mix scenarios. Further analysis shows that the serialization module identifies dominant speech components in a mixture by factors including loudness and gender, and orders speech components based on the dominance score.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_03966
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Serialized Output Training by Learned Dominance Shi, Ying Li, Lantian Yin, Shi Wang, Dong Han, Jiqing Sound Artificial Intelligence Audio and Speech Processing Serialized Output Training (SOT) has showcased state-of-the-art performance in multi-talker speech recognition by sequentially decoding the speech of individual speakers. To address the challenging label-permutation issue, prior methods have relied on either the Permutation Invariant Training (PIT) or the time-based First-In-First-Out (FIFO) rule. This study presents a model-based serialization strategy that incorporates an auxiliary module into the Attention Encoder-Decoder architecture, autonomously identifying the crucial factors to order the output sequence of the speech components in multi-talker speech. Experiments conducted on the LibriSpeech and LibriMix databases reveal that our approach significantly outperforms the PIT and FIFO baselines in both 2-mix and 3-mix scenarios. Further analysis shows that the serialization module identifies dominant speech components in a mixture by factors including loudness and gender, and orders speech components based on the dominance score.
title	Serialized Output Training by Learned Dominance
topic	Sound Artificial Intelligence Audio and Speech Processing
url	https://arxiv.org/abs/2407.03966

Similar Items