Saved in:
Bibliographic Details
Main Authors: Yang, Yifan, Han, Bing, Wang, Hui, Zhou, Long, Wang, Wei, Cui, Mingyu, Tan, Xu, Chen, Xie
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.19928
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918422303997952
author Yang, Yifan
Han, Bing
Wang, Hui
Zhou, Long
Wang, Wei
Cui, Mingyu
Tan, Xu
Chen, Xie
author_facet Yang, Yifan
Han, Bing
Wang, Hui
Zhou, Long
Wang, Wei
Cui, Mingyu
Tan, Xu
Chen, Xie
contents Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably quantifying prosody diversity underexplored. To bridge this gap, we introduce ProsodyEval, a prosody diversity assessment dataset that provides Prosody Mean Opinion Score (PMOS) alongside conventional acoustic metrics. ProsodyEval comprises 1000 speech samples derived from 7 mainstream TTS systems, with 2000 human ratings. Building on this, we propose the Discretized Speech Weighted Edit Distance (DS-WED), a new objective diversity metric that quantifies prosodic variation via weighted edit distance over semantic tokens. Experiments on ProsodyEval show that DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, while remaining highly robust in speech tokenization from HuBERT and WavLM. Leveraging DS-WED, we benchmark state-of-the-art open-source TTS systems on LibriSpeech test-clean and Seed-TTS test-en, and further explorations uncover several factors that influence prosody diversity, including generative modeling paradigms, duration control, and reinforcement learning. Moreover, we find that current large audio language models (LALMs) remain limited in capturing prosodic variations. Audio samples are available at https://prosodyeval.github.io.
format Preprint
id arxiv_https___arxiv_org_abs_2509_19928
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration
Yang, Yifan
Han, Bing
Wang, Hui
Zhou, Long
Wang, Wei
Cui, Mingyu
Tan, Xu
Chen, Xie
Audio and Speech Processing
Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably quantifying prosody diversity underexplored. To bridge this gap, we introduce ProsodyEval, a prosody diversity assessment dataset that provides Prosody Mean Opinion Score (PMOS) alongside conventional acoustic metrics. ProsodyEval comprises 1000 speech samples derived from 7 mainstream TTS systems, with 2000 human ratings. Building on this, we propose the Discretized Speech Weighted Edit Distance (DS-WED), a new objective diversity metric that quantifies prosodic variation via weighted edit distance over semantic tokens. Experiments on ProsodyEval show that DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, while remaining highly robust in speech tokenization from HuBERT and WavLM. Leveraging DS-WED, we benchmark state-of-the-art open-source TTS systems on LibriSpeech test-clean and Seed-TTS test-en, and further explorations uncover several factors that influence prosody diversity, including generative modeling paradigms, duration control, and reinforcement learning. Moreover, we find that current large audio language models (LALMs) remain limited in capturing prosodic variations. Audio samples are available at https://prosodyeval.github.io.
title Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration
topic Audio and Speech Processing
url https://arxiv.org/abs/2509.19928