Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autor principal:	Sakpiboonchit, Siratish
Formato:	Preprint
Publicado:	2025
Materias:	Audio and Speech Processing Sound
Acceso en línea:	https://arxiv.org/abs/2509.08696
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866915488606453760
author	Sakpiboonchit, Siratish
author_facet	Sakpiboonchit, Siratish
contents	This paper presents a method to accelerate the inference process of diffusion transformer (DiT)-based text-to-speech (TTS) models by applying a selective caching mechanism to transformer layers. Specifically, I integrate SmoothCache into the F5-TTS architecture, focusing on caching outputs of self-attention and feed-forward network layers to reduce redundant computations during the denoising process. A calibration phase is introduced to analyze L1 relative errors between timesteps, guiding the selection of cache schedules that minimize quality degradation. To address the problem of inter-layer dependency, a unified caching schedule is adopted, applying the cache pattern derived from self-attention layers to both layer types. Experiments on LibriSpeech-PC and Seed-TTS datasets evaluate various cache thresholds and denoising step configurations. Results show that caching at higher denoising steps reduces inference time without compromising output quality, whereas caching at lower steps can negatively impact synthesis quality similarly to reducing the total number of denoising steps. Objective and subjective metrics confirm the effectiveness of SmoothCache in maintaining performance while improving computational efficiency. Comparisons between cached inference and reduced-step inference further highlight the benefits of selective caching, especially under high-step configurations. This work demonstrates that transformer layer caching is a practical solution for optimizing diffusion transformer-based TTS models without requiring architectural changes or retraining. Example inference results can be heard at https://siratish.github.io/F5-TTS_SmoothCache/ .
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_08696
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching Sakpiboonchit, Siratish Audio and Speech Processing Sound This paper presents a method to accelerate the inference process of diffusion transformer (DiT)-based text-to-speech (TTS) models by applying a selective caching mechanism to transformer layers. Specifically, I integrate SmoothCache into the F5-TTS architecture, focusing on caching outputs of self-attention and feed-forward network layers to reduce redundant computations during the denoising process. A calibration phase is introduced to analyze L1 relative errors between timesteps, guiding the selection of cache schedules that minimize quality degradation. To address the problem of inter-layer dependency, a unified caching schedule is adopted, applying the cache pattern derived from self-attention layers to both layer types. Experiments on LibriSpeech-PC and Seed-TTS datasets evaluate various cache thresholds and denoising step configurations. Results show that caching at higher denoising steps reduces inference time without compromising output quality, whereas caching at lower steps can negatively impact synthesis quality similarly to reducing the total number of denoising steps. Objective and subjective metrics confirm the effectiveness of SmoothCache in maintaining performance while improving computational efficiency. Comparisons between cached inference and reduced-step inference further highlight the benefits of selective caching, especially under high-step configurations. This work demonstrates that transformer layer caching is a practical solution for optimizing diffusion transformer-based TTS models without requiring architectural changes or retraining. Example inference results can be heard at https://siratish.github.io/F5-TTS_SmoothCache/ .
title	Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching
topic	Audio and Speech Processing Sound
url	https://arxiv.org/abs/2509.08696

Ejemplares similares