Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Haoyu, Han, Mingyang, Xi, Yu, Wang, Dongxiao, Wang, Hankun, Shi, Haoxiang, Li, Boyu, Song, Jun, Zheng, Bo, Wang, Shuai, Yu, Kai
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2511.09995
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Flow-Matching (FM)-based zero-shot text-to-speech (TTS) systems exhibit high-quality speech synthesis and robust generalization capabilities. However, the speaker representation ability of such systems remains underexplored, primarily due to the lack of explicit speaker-specific supervision in the FM framework. To this end, we conduct an empirical analysis of speaker information distribution and reveal its non-uniform allocation across time steps and network layers, underscoring the need for adaptive speaker alignment. Accordingly, we propose Time-Layer Adaptive Speaker Alignment (TLA-SA), a strategy that enhances speaker consistency by jointly leveraging temporal and hierarchical variations. Experimental results show that TLA-SA substantially improves speaker similarity over baseline systems on both research- and industrial-scale datasets and generalizes well across diverse model architectures, including decoder-only language model (LM)-based and free TTS systems. A demo is provided.

Similar Items