Saved in:
Bibliographic Details
Main Authors: Shi, Ying, Li, Lantian, Yin, Shi, Wang, Dong, Han, Jiqing
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2407.03966
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929410296250368
author Shi, Ying
Li, Lantian
Yin, Shi
Wang, Dong
Han, Jiqing
author_facet Shi, Ying
Li, Lantian
Yin, Shi
Wang, Dong
Han, Jiqing
contents Serialized Output Training (SOT) has showcased state-of-the-art performance in multi-talker speech recognition by sequentially decoding the speech of individual speakers. To address the challenging label-permutation issue, prior methods have relied on either the Permutation Invariant Training (PIT) or the time-based First-In-First-Out (FIFO) rule. This study presents a model-based serialization strategy that incorporates an auxiliary module into the Attention Encoder-Decoder architecture, autonomously identifying the crucial factors to order the output sequence of the speech components in multi-talker speech. Experiments conducted on the LibriSpeech and LibriMix databases reveal that our approach significantly outperforms the PIT and FIFO baselines in both 2-mix and 3-mix scenarios. Further analysis shows that the serialization module identifies dominant speech components in a mixture by factors including loudness and gender, and orders speech components based on the dominance score.
format Preprint
id arxiv_https___arxiv_org_abs_2407_03966
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Serialized Output Training by Learned Dominance
Shi, Ying
Li, Lantian
Yin, Shi
Wang, Dong
Han, Jiqing
Sound
Artificial Intelligence
Audio and Speech Processing
Serialized Output Training (SOT) has showcased state-of-the-art performance in multi-talker speech recognition by sequentially decoding the speech of individual speakers. To address the challenging label-permutation issue, prior methods have relied on either the Permutation Invariant Training (PIT) or the time-based First-In-First-Out (FIFO) rule. This study presents a model-based serialization strategy that incorporates an auxiliary module into the Attention Encoder-Decoder architecture, autonomously identifying the crucial factors to order the output sequence of the speech components in multi-talker speech. Experiments conducted on the LibriSpeech and LibriMix databases reveal that our approach significantly outperforms the PIT and FIFO baselines in both 2-mix and 3-mix scenarios. Further analysis shows that the serialization module identifies dominant speech components in a mixture by factors including loudness and gender, and orders speech components based on the dominance score.
title Serialized Output Training by Learned Dominance
topic Sound
Artificial Intelligence
Audio and Speech Processing
url https://arxiv.org/abs/2407.03966