Saved in:
Bibliographic Details
Main Authors: Cui, Jiayan, Yang, Zhihan, Li, Naihan, Tian, Jiankun, Ma, Xingyu, Zhang, Yi, Chen, Guangyu, Yang, Runxuan, Cheng, Yuqing, Zhou, Yizhi, Yu, Guochen, Gu, Xiaotao, Tang, Jie
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.14291
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918249912860672
author Cui, Jiayan
Yang, Zhihan
Li, Naihan
Tian, Jiankun
Ma, Xingyu
Zhang, Yi
Chen, Guangyu
Yang, Runxuan
Cheng, Yuqing
Zhou, Yizhi
Yu, Guochen
Gu, Xiaotao
Tang, Jie
author_facet Cui, Jiayan
Yang, Zhihan
Li, Naihan
Tian, Jiankun
Ma, Xingyu
Zhang, Yi
Chen, Guangyu
Yang, Runxuan
Cheng, Yuqing
Zhou, Yizhi
Yu, Guochen
Gu, Xiaotao
Tang, Jie
contents This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).
format Preprint
id arxiv_https___arxiv_org_abs_2512_14291
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle GLM-TTS Technical Report
Cui, Jiayan
Yang, Zhihan
Li, Naihan
Tian, Jiankun
Ma, Xingyu
Zhang, Yi
Chen, Guangyu
Yang, Runxuan
Cheng, Yuqing
Zhou, Yizhi
Yu, Guochen
Gu, Xiaotao
Tang, Jie
Sound
This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).
title GLM-TTS Technical Report
topic Sound
url https://arxiv.org/abs/2512.14291