Saved in:
| Main Authors: | , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.14291 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866918249912860672 |
|---|---|
| author | Cui, Jiayan Yang, Zhihan Li, Naihan Tian, Jiankun Ma, Xingyu Zhang, Yi Chen, Guangyu Yang, Runxuan Cheng, Yuqing Zhou, Yizhi Yu, Guochen Gu, Xiaotao Tang, Jie |
| author_facet | Cui, Jiayan Yang, Zhihan Li, Naihan Tian, Jiankun Ma, Xingyu Zhang, Yi Chen, Guangyu Yang, Runxuan Cheng, Yuqing Zhou, Yizhi Yu, Guochen Gu, Xiaotao Tang, Jie |
| contents | This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn). |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2512_14291 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | GLM-TTS Technical Report Cui, Jiayan Yang, Zhihan Li, Naihan Tian, Jiankun Ma, Xingyu Zhang, Yi Chen, Guangyu Yang, Runxuan Cheng, Yuqing Zhou, Yizhi Yu, Guochen Gu, Xiaotao Tang, Jie Sound This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn). |
| title | GLM-TTS Technical Report |
| topic | Sound |
| url | https://arxiv.org/abs/2512.14291 |