MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Wang, Wei, Cao, Rong, Guo, Yi, Chen, Zhengyang, Chen, Kuan, Huo, Yuanyuan
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Sound
Accesso online:	https://arxiv.org/abs/2510.07979
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866914082491203584
author	Wang, Wei Cao, Rong Guo, Yi Chen, Zhengyang Chen, Kuan Huo, Yuanyuan
author_facet	Wang, Wei Cao, Rong Guo, Yi Chen, Zhengyang Chen, Kuan Huo, Yuanyuan
contents	Flow-based generative models have greatly improved text-to-speech (TTS) synthesis quality, but inference speed remains limited by the iterative sampling process and multiple function evaluations (NFE). The recent MeanFlow model accelerates generation by modeling average velocity instead of instantaneous velocity. However, its direct application to TTS encounters challenges, including GPU memory overhead from Jacobian-vector products (JVP) and training instability due to self-bootstrap processes. To address these issues, we introduce IntMeanFlow, a framework for few-step speech generation with integral velocity distillation. By approximating average velocity with the teacher's instantaneous velocity over a temporal interval, IntMeanFlow eliminates the need for JVPs and self-bootstrap, improving stability and reducing GPU memory usage. We also propose the Optimal Step Sampling Search (O3S) algorithm, which identifies the model-specific optimal sampling steps, improving speech synthesis without additional inference overhead. Experiments show that IntMeanFlow achieves 1-NFE inference for token-to-spectrogram and 3-NFE for text-to-spectrogram tasks while maintaining high-quality synthesis. Demo samples are available at https://vvwangvv.github.io/intmeanflow.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_07979
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation Wang, Wei Cao, Rong Guo, Yi Chen, Zhengyang Chen, Kuan Huo, Yuanyuan Sound Flow-based generative models have greatly improved text-to-speech (TTS) synthesis quality, but inference speed remains limited by the iterative sampling process and multiple function evaluations (NFE). The recent MeanFlow model accelerates generation by modeling average velocity instead of instantaneous velocity. However, its direct application to TTS encounters challenges, including GPU memory overhead from Jacobian-vector products (JVP) and training instability due to self-bootstrap processes. To address these issues, we introduce IntMeanFlow, a framework for few-step speech generation with integral velocity distillation. By approximating average velocity with the teacher's instantaneous velocity over a temporal interval, IntMeanFlow eliminates the need for JVPs and self-bootstrap, improving stability and reducing GPU memory usage. We also propose the Optimal Step Sampling Search (O3S) algorithm, which identifies the model-specific optimal sampling steps, improving speech synthesis without additional inference overhead. Experiments show that IntMeanFlow achieves 1-NFE inference for token-to-spectrogram and 3-NFE for text-to-spectrogram tasks while maintaining high-quality synthesis. Demo samples are available at https://vvwangvv.github.io/intmeanflow.
title	IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation
topic	Sound
url	https://arxiv.org/abs/2510.07979

Documenti analoghi