Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yang, Tianle, Sun, Chengzhe, Rose, Phil, Jacobs, Cassandra L., Lyu, Siwei
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Artificial Intelligence Sound
Online Access:	https://arxiv.org/abs/2603.21078
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models' ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems' ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.

Similar Items