Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ma, Jianbo, Cartwright, Richard
Format:	Preprint
Published:	2026
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2604.19330
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913072423108608
author	Ma, Jianbo Cartwright, Richard
author_facet	Ma, Jianbo Cartwright, Richard
contents	Recent advances in Text-To-Speech (TTS) synthesis have seen the popularity of multi-stage approaches that first predict semantic tokens and then generate acoustic tokens. In this paper, we extend the coarse-to-fine generation paradigm to the temporal domain and introduce Chain-of-Details (CoD), a novel framework that explicitly models temporal coarse-to-fine dynamics in speech generation using a cascaded architecture. Our method progressively refines temporal details across multiple stages, with each stage targeting a specific temporal granularity. All temporal detail predictions are performed using a shared decoder, enabling efficient parameter utilization across different temporal resolutions. Notably, we observe that the lowest detail level naturally performs phonetic planning without the need for an explicit phoneme duration predictor. We evaluate our method on several datasets and compare it against several baselines. Experimental results show that CoD achieves competitive performance with significantly fewer parameters than existing approaches. Our findings demonstrate that explicit modeling of temporal dynamics with the CoD framework leads to more natural speech synthesis.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_19330
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation Ma, Jianbo Cartwright, Richard Audio and Speech Processing Recent advances in Text-To-Speech (TTS) synthesis have seen the popularity of multi-stage approaches that first predict semantic tokens and then generate acoustic tokens. In this paper, we extend the coarse-to-fine generation paradigm to the temporal domain and introduce Chain-of-Details (CoD), a novel framework that explicitly models temporal coarse-to-fine dynamics in speech generation using a cascaded architecture. Our method progressively refines temporal details across multiple stages, with each stage targeting a specific temporal granularity. All temporal detail predictions are performed using a shared decoder, enabling efficient parameter utilization across different temporal resolutions. Notably, we observe that the lowest detail level naturally performs phonetic planning without the need for an explicit phoneme duration predictor. We evaluate our method on several datasets and compare it against several baselines. Experimental results show that CoD achieves competitive performance with significantly fewer parameters than existing approaches. Our findings demonstrate that explicit modeling of temporal dynamics with the CoD framework leads to more natural speech synthesis.
title	Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2604.19330

Similar Items