Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.06608 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866910641835474944 |
|---|---|
| author | Susladkar, Onkar Kishor Tripathi, Vishesh Ahmed, Biddwan |
| author_facet | Susladkar, Onkar Kishor Tripathi, Vishesh Ahmed, Biddwan |
| contents | This research introduces a comprehensive Bahasa text-to-speech (TTS) dataset and a novel TTS model, EnGen-TTS, designed to enhance the quality and versatility of synthetic speech in the Bahasa language. The dataset, spanning \textasciitilde55.0 hours and 52K audio recordings, integrates diverse textual sources, ensuring linguistic richness. A meticulous recording setup captures the nuances of Bahasa phonetics, employing professional equipment to ensure high-fidelity audio samples. Statistical analysis reveals the dataset's scale and diversity, laying the foundation for model training and evaluation. The proposed EnGen-TTS model performs better than established baselines, achieving a Mean Opinion Score (MOS) of 4.45 $\pm$ 0.13. Additionally, our investigation on real-time factor and model size highlights EnGen-TTS as a compelling choice, with efficient performance. This research marks a significant advancement in Bahasa TTS technology, with implications for diverse language applications. Link to Generated Samples: \url{https://bahasa-harmony-comp.vercel.app/} |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2410_06608 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS Susladkar, Onkar Kishor Tripathi, Vishesh Ahmed, Biddwan Sound Artificial Intelligence Audio and Speech Processing This research introduces a comprehensive Bahasa text-to-speech (TTS) dataset and a novel TTS model, EnGen-TTS, designed to enhance the quality and versatility of synthetic speech in the Bahasa language. The dataset, spanning \textasciitilde55.0 hours and 52K audio recordings, integrates diverse textual sources, ensuring linguistic richness. A meticulous recording setup captures the nuances of Bahasa phonetics, employing professional equipment to ensure high-fidelity audio samples. Statistical analysis reveals the dataset's scale and diversity, laying the foundation for model training and evaluation. The proposed EnGen-TTS model performs better than established baselines, achieving a Mean Opinion Score (MOS) of 4.45 $\pm$ 0.13. Additionally, our investigation on real-time factor and model size highlights EnGen-TTS as a compelling choice, with efficient performance. This research marks a significant advancement in Bahasa TTS technology, with implications for diverse language applications. Link to Generated Samples: \url{https://bahasa-harmony-comp.vercel.app/} |
| title | Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS |
| topic | Sound Artificial Intelligence Audio and Speech Processing |
| url | https://arxiv.org/abs/2410.06608 |