Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yang, Yifan, Han, Bing, Wang, Hui, Zhou, Long, Wang, Wei, Cui, Mingyu, Tan, Xu, Chen, Xie
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2509.19928
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918422303997952
author	Yang, Yifan Han, Bing Wang, Hui Zhou, Long Wang, Wei Cui, Mingyu Tan, Xu Chen, Xie
author_facet	Yang, Yifan Han, Bing Wang, Hui Zhou, Long Wang, Wei Cui, Mingyu Tan, Xu Chen, Xie
contents	Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably quantifying prosody diversity underexplored. To bridge this gap, we introduce ProsodyEval, a prosody diversity assessment dataset that provides Prosody Mean Opinion Score (PMOS) alongside conventional acoustic metrics. ProsodyEval comprises 1000 speech samples derived from 7 mainstream TTS systems, with 2000 human ratings. Building on this, we propose the Discretized Speech Weighted Edit Distance (DS-WED), a new objective diversity metric that quantifies prosodic variation via weighted edit distance over semantic tokens. Experiments on ProsodyEval show that DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, while remaining highly robust in speech tokenization from HuBERT and WavLM. Leveraging DS-WED, we benchmark state-of-the-art open-source TTS systems on LibriSpeech test-clean and Seed-TTS test-en, and further explorations uncover several factors that influence prosody diversity, including generative modeling paradigms, duration control, and reinforcement learning. Moreover, we find that current large audio language models (LALMs) remain limited in capturing prosodic variations. Audio samples are available at https://prosodyeval.github.io.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_19928
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration Yang, Yifan Han, Bing Wang, Hui Zhou, Long Wang, Wei Cui, Mingyu Tan, Xu Chen, Xie Audio and Speech Processing Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably quantifying prosody diversity underexplored. To bridge this gap, we introduce ProsodyEval, a prosody diversity assessment dataset that provides Prosody Mean Opinion Score (PMOS) alongside conventional acoustic metrics. ProsodyEval comprises 1000 speech samples derived from 7 mainstream TTS systems, with 2000 human ratings. Building on this, we propose the Discretized Speech Weighted Edit Distance (DS-WED), a new objective diversity metric that quantifies prosodic variation via weighted edit distance over semantic tokens. Experiments on ProsodyEval show that DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, while remaining highly robust in speech tokenization from HuBERT and WavLM. Leveraging DS-WED, we benchmark state-of-the-art open-source TTS systems on LibriSpeech test-clean and Seed-TTS test-en, and further explorations uncover several factors that influence prosody diversity, including generative modeling paradigms, duration control, and reinforcement learning. Moreover, we find that current large audio language models (LALMs) remain limited in capturing prosodic variations. Audio samples are available at https://prosodyeval.github.io.
title	Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2509.19928

Similar Items