Salvato in:
Dettagli Bibliografici
Autori principali: Pandey, Isha, Gaikwad, Pranav, Parulekar, Amruta, Ramakrishnan, Ganesh
Natura: Preprint
Pubblicazione: 2025
Soggetti:
Accesso online:https://arxiv.org/abs/2507.16875
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866908461975994368
author Pandey, Isha
Gaikwad, Pranav
Parulekar, Amruta
Ramakrishnan, Ganesh
author_facet Pandey, Isha
Gaikwad, Pranav
Parulekar, Amruta
Ramakrishnan, Ganesh
contents High-quality speech generation for low-resource languages, such as many Indian languages, remains a significant challenge due to limited data and diverse linguistic structures. Duration prediction is a critical component in many speech generation pipelines, playing a key role in modeling prosody and speech rhythm. While some recent generative approaches choose to omit explicit duration modeling, often at the cost of longer training times. We retain and explore this module to better understand its impact in the linguistically rich and data-scarce landscape of India. We train a non-autoregressive Continuous Normalizing Flow (CNF) based speech model using publicly available Indian language data and evaluate multiple duration prediction strategies for zero-shot, speaker-specific generation. Our comparative analysis on speech-infilling tasks reveals nuanced trade-offs: infilling based predictors improve intelligibility in some languages, while speaker-prompted predictors better preserve speaker characteristics in others. These findings inform the design and selection of duration strategies tailored to specific languages and tasks, underscoring the continued value of interpretable components like duration prediction in adapting advanced generative architectures to low-resource, multilingual settings.
format Preprint
id arxiv_https___arxiv_org_abs_2507_16875
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages
Pandey, Isha
Gaikwad, Pranav
Parulekar, Amruta
Ramakrishnan, Ganesh
Audio and Speech Processing
Machine Learning
High-quality speech generation for low-resource languages, such as many Indian languages, remains a significant challenge due to limited data and diverse linguistic structures. Duration prediction is a critical component in many speech generation pipelines, playing a key role in modeling prosody and speech rhythm. While some recent generative approaches choose to omit explicit duration modeling, often at the cost of longer training times. We retain and explore this module to better understand its impact in the linguistically rich and data-scarce landscape of India. We train a non-autoregressive Continuous Normalizing Flow (CNF) based speech model using publicly available Indian language data and evaluate multiple duration prediction strategies for zero-shot, speaker-specific generation. Our comparative analysis on speech-infilling tasks reveals nuanced trade-offs: infilling based predictors improve intelligibility in some languages, while speaker-prompted predictors better preserve speaker characteristics in others. These findings inform the design and selection of duration strategies tailored to specific languages and tasks, underscoring the continued value of interpretable components like duration prediction in adapting advanced generative architectures to low-resource, multilingual settings.
title Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages
topic Audio and Speech Processing
Machine Learning
url https://arxiv.org/abs/2507.16875