Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cui, Jiayan, Yang, Zhihan, Li, Naihan, Tian, Jiankun, Ma, Xingyu, Zhang, Yi, Chen, Guangyu, Yang, Runxuan, Cheng, Yuqing, Zhou, Yizhi, Yu, Guochen, Gu, Xiaotao, Tang, Jie
Format:	Preprint
Published:	2025
Subjects:	Sound
Online Access:	https://arxiv.org/abs/2512.14291
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918249912860672
author	Cui, Jiayan Yang, Zhihan Li, Naihan Tian, Jiankun Ma, Xingyu Zhang, Yi Chen, Guangyu Yang, Runxuan Cheng, Yuqing Zhou, Yizhi Yu, Guochen Gu, Xiaotao Tang, Jie
author_facet	Cui, Jiayan Yang, Zhihan Li, Naihan Tian, Jiankun Ma, Xingyu Zhang, Yi Chen, Guangyu Yang, Runxuan Cheng, Yuqing Zhou, Yizhi Yu, Guochen Gu, Xiaotao Tang, Jie
contents	This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_14291
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	GLM-TTS Technical Report Cui, Jiayan Yang, Zhihan Li, Naihan Tian, Jiankun Ma, Xingyu Zhang, Yi Chen, Guangyu Yang, Runxuan Cheng, Yuqing Zhou, Yizhi Yu, Guochen Gu, Xiaotao Tang, Jie Sound This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).
title	GLM-TTS Technical Report
topic	Sound
url	https://arxiv.org/abs/2512.14291

Similar Items