Saved in:
Bibliographic Details
Main Authors: Li, Haoyu, Han, Mingyang, Xi, Yu, Wang, Dongxiao, Wang, Hankun, Shi, Haoxiang, Li, Boyu, Song, Jun, Zheng, Bo, Wang, Shuai, Yu, Kai
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.09995
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Flow-Matching (FM)-based zero-shot text-to-speech (TTS) systems exhibit high-quality speech synthesis and robust generalization capabilities. However, the speaker representation ability of such systems remains underexplored, primarily due to the lack of explicit speaker-specific supervision in the FM framework. To this end, we conduct an empirical analysis of speaker information distribution and reveal its non-uniform allocation across time steps and network layers, underscoring the need for adaptive speaker alignment. Accordingly, we propose Time-Layer Adaptive Speaker Alignment (TLA-SA), a strategy that enhances speaker consistency by jointly leveraging temporal and hierarchical variations. Experimental results show that TLA-SA substantially improves speaker similarity over baseline systems on both research- and industrial-scale datasets and generalizes well across diverse model architectures, including decoder-only language model (LM)-based and free TTS systems. A demo is provided.