MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Pandey, Isha, Gaikwad, Pranav, Parulekar, Amruta, Ramakrishnan, Ganesh
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Audio and Speech Processing Machine Learning
Accesso online:	https://arxiv.org/abs/2507.16875
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866908461975994368
author	Pandey, Isha Gaikwad, Pranav Parulekar, Amruta Ramakrishnan, Ganesh
author_facet	Pandey, Isha Gaikwad, Pranav Parulekar, Amruta Ramakrishnan, Ganesh
contents	High-quality speech generation for low-resource languages, such as many Indian languages, remains a significant challenge due to limited data and diverse linguistic structures. Duration prediction is a critical component in many speech generation pipelines, playing a key role in modeling prosody and speech rhythm. While some recent generative approaches choose to omit explicit duration modeling, often at the cost of longer training times. We retain and explore this module to better understand its impact in the linguistically rich and data-scarce landscape of India. We train a non-autoregressive Continuous Normalizing Flow (CNF) based speech model using publicly available Indian language data and evaluate multiple duration prediction strategies for zero-shot, speaker-specific generation. Our comparative analysis on speech-infilling tasks reveals nuanced trade-offs: infilling based predictors improve intelligibility in some languages, while speaker-prompted predictors better preserve speaker characteristics in others. These findings inform the design and selection of duration strategies tailored to specific languages and tasks, underscoring the continued value of interpretable components like duration prediction in adapting advanced generative architectures to low-resource, multilingual settings.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_16875
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages Pandey, Isha Gaikwad, Pranav Parulekar, Amruta Ramakrishnan, Ganesh Audio and Speech Processing Machine Learning High-quality speech generation for low-resource languages, such as many Indian languages, remains a significant challenge due to limited data and diverse linguistic structures. Duration prediction is a critical component in many speech generation pipelines, playing a key role in modeling prosody and speech rhythm. While some recent generative approaches choose to omit explicit duration modeling, often at the cost of longer training times. We retain and explore this module to better understand its impact in the linguistically rich and data-scarce landscape of India. We train a non-autoregressive Continuous Normalizing Flow (CNF) based speech model using publicly available Indian language data and evaluate multiple duration prediction strategies for zero-shot, speaker-specific generation. Our comparative analysis on speech-infilling tasks reveals nuanced trade-offs: infilling based predictors improve intelligibility in some languages, while speaker-prompted predictors better preserve speaker characteristics in others. These findings inform the design and selection of duration strategies tailored to specific languages and tasks, underscoring the continued value of interpretable components like duration prediction in adapting advanced generative architectures to low-resource, multilingual settings.
title	Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages
topic	Audio and Speech Processing Machine Learning
url	https://arxiv.org/abs/2507.16875

Documenti analoghi