Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Tang, Beilong, Zeng, Bang, Li, Ming
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2504.07402
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915447837818880
author	Tang, Beilong Zeng, Bang Li, Ming
author_facet	Tang, Beilong Zeng, Bang Li, Ming
contents	We propose LauraTSE, an Auto-Regressive Decoder-Only Language Model for Target Speaker Extraction built upon the LauraGPT backbone. LauraTSE employs a small-scale auto-regressive decoder-only language model that generates the initial layers of the target speech's discrete codec representations from the continuous embeddings of both the mixture and reference speech. These outputs serve as coarse-grained predictions. To refine them, a one-step encoder-only language model reconstructs the full codec representation by integrating information from both the mixture and the reference speech, adding fine-grained details. Experimental results show that our approach can achieve promising performance. Additionally, we conduct ablation studies to investigate the data scalability and the contribution of the encoder-only model.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_07402
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only Language Models Tang, Beilong Zeng, Bang Li, Ming Machine Learning Artificial Intelligence We propose LauraTSE, an Auto-Regressive Decoder-Only Language Model for Target Speaker Extraction built upon the LauraGPT backbone. LauraTSE employs a small-scale auto-regressive decoder-only language model that generates the initial layers of the target speech's discrete codec representations from the continuous embeddings of both the mixture and reference speech. These outputs serve as coarse-grained predictions. To refine them, a one-step encoder-only language model reconstructs the full codec representation by integrating information from both the mixture and the reference speech, adding fine-grained details. Experimental results show that our approach can achieve promising performance. Additionally, we conduct ablation studies to investigate the data scalability and the contribution of the encoder-only model.
title	LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only Language Models
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2504.07402

Similar Items