Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Shen, Shengfan, Wu, Di, Song, Xingchen, Zhou, Dinghao, Xue, Liumeng, Meng, Meng, Luan, Jian, Wang, Shuai
Formato:	Preprint
Publicado:	2026
Materias:	Sound
Acceso en línea:	https://arxiv.org/abs/2603.24430
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866915891034193920
author	Shen, Shengfan Wu, Di Song, Xingchen Zhou, Dinghao Xue, Liumeng Meng, Meng Luan, Jian Wang, Shuai
author_facet	Shen, Shengfan Wu, Di Song, Xingchen Zhou, Dinghao Xue, Liumeng Meng, Meng Luan, Jian Wang, Shuai
contents	Reliable evaluation of modern zero-shot text-to-speech (TTS) models remains challenging. Subjective tests are costly and hard to reproduce, while objective metrics often saturate, failing to distinguish SOTA systems. To address this, we propose Iterate to Differentiate (I2D), an evaluation framework that recursively synthesizes speech using the model's own outputs as references. Higher-quality models exhibit greater resilience to the distributional shift induced by iterative synthesis, resulting in slower performance degradation. I2D exploits this differential degradation to amplify performance gaps and reveal robustness. By aggregating objective metrics across iterations, I2D improves discriminability and alignment with human judgments, increasing system-level SRCC from 0.118 to 0.464 for UTMOSv2. Experiments on 11 models across Chinese, English, and emotion datasets demonstrate that I2D enables more reliable automated evaluation for zero-shot TTS.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_24430
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Iterate to Differentiate: Enhancing Discriminability and Reliability in Zero-Shot TTS Evaluation Shen, Shengfan Wu, Di Song, Xingchen Zhou, Dinghao Xue, Liumeng Meng, Meng Luan, Jian Wang, Shuai Sound Reliable evaluation of modern zero-shot text-to-speech (TTS) models remains challenging. Subjective tests are costly and hard to reproduce, while objective metrics often saturate, failing to distinguish SOTA systems. To address this, we propose Iterate to Differentiate (I2D), an evaluation framework that recursively synthesizes speech using the model's own outputs as references. Higher-quality models exhibit greater resilience to the distributional shift induced by iterative synthesis, resulting in slower performance degradation. I2D exploits this differential degradation to amplify performance gaps and reveal robustness. By aggregating objective metrics across iterations, I2D improves discriminability and alignment with human judgments, increasing system-level SRCC from 0.118 to 0.464 for UTMOSv2. Experiments on 11 models across Chinese, English, and emotion datasets demonstrate that I2D enables more reliable automated evaluation for zero-shot TTS.
title	Iterate to Differentiate: Enhancing Discriminability and Reliability in Zero-Shot TTS Evaluation
topic	Sound
url	https://arxiv.org/abs/2603.24430

Ejemplares similares