Enregistré dans:
Détails bibliographiques
Auteurs principaux: Wang, Xi, Wang, Jie, Song, Xingchen, Song, Baijun, Xie, Jingran, Shao, Jiahe, Lin, Zijian, Wu, Di, Meng, Meng, Luan, Jian, Wu, Zhiyong
Format: Preprint
Publié: 2026
Sujets:
Accès en ligne:https://arxiv.org/abs/2604.22225
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
Table des matières:
  • While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and checkpoints at https://github.com/xiaomi-research/tts-prism.