Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yang, Yiming, Wang, Guangyong, Guan, Haixin, Long, Yanhua
Format:	Preprint
Published:	2026
Subjects:	Audio and Speech Processing Sound
Online Access:	https://arxiv.org/abs/2602.15519
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911464064811008
author	Yang, Yiming Wang, Guangyong Guan, Haixin Long, Yanhua
author_facet	Yang, Yiming Wang, Guangyong Guan, Haixin Long, Yanhua
contents	Target speech extraction (TSE) typically relies on pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interaction. In this paper, we propose Enroll-on-Wakeup (EoW), a novel framework where the wake-word segment, captured naturally during human-machine interaction, is automatically utilized as the enrollment reference. This eliminates the need for pre-collected speech to enable a seamless experience. We perform the first systematic study of EoW-TSE, evaluating advanced discriminative and generative models under real diverse acoustic conditions. Given the short and noisy nature of wake-word segments, we investigate enrollment augmentation using LLM-based TTS. Results show that while current TSE models face performance degradation in EoW-TSE, TTS-based assistance significantly enhances the listening experience, though gaps remain in speech recognition accuracy.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_15519
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios Yang, Yiming Wang, Guangyong Guan, Haixin Long, Yanhua Audio and Speech Processing Sound Target speech extraction (TSE) typically relies on pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interaction. In this paper, we propose Enroll-on-Wakeup (EoW), a novel framework where the wake-word segment, captured naturally during human-machine interaction, is automatically utilized as the enrollment reference. This eliminates the need for pre-collected speech to enable a seamless experience. We perform the first systematic study of EoW-TSE, evaluating advanced discriminative and generative models under real diverse acoustic conditions. Given the short and noisy nature of wake-word segments, we investigate enrollment augmentation using LLM-based TTS. Results show that while current TSE models face performance degradation in EoW-TSE, TTS-based assistance significantly enhances the listening experience, though gaps remain in speech recognition accuracy.
title	Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios
topic	Audio and Speech Processing Sound
url	https://arxiv.org/abs/2602.15519

Similar Items