Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhao, Zhiyuan, Lin, Lijian, Zhu, Ye, Xie, Kai, Liu, Yunfei, Li, Yu
Format:	Preprint
Published:	2026
Subjects:	Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2601.04233
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917189110464512
author	Zhao, Zhiyuan Lin, Lijian Zhu, Ye Xie, Kai Liu, Yunfei Li, Yu
author_facet	Zhao, Zhiyuan Lin, Lijian Zhu, Ye Xie, Kai Liu, Yunfei Li, Yu
contents	We present the LEMAS-Dataset, which, to our knowledge, is currently the largest open-source multilingual speech corpus with word-level timestamps. Covering over 150,000 hours across 10 major languages, LEMAS-Dataset is constructed via a efficient data processing pipeline that ensures high-quality data and annotations. To validate the effectiveness of LEMAS-Dataset across diverse generative paradigms, we train two benchmark models with distinct architectures and task specializations on this dataset. LEMAS-TTS, built upon a non-autoregressive flow-matching framework, leverages the dataset's massive scale and linguistic diversity to achieve robust zero-shot multilingual synthesis. Our proposed accent-adversarial training and CTC loss mitigate cross-lingual accent issues, enhancing synthesis stability. Complementarily, LEMAS-Edit employs an autoregressive decoder-only architecture that formulates speech editing as a masked token infilling task. By exploiting precise word-level alignments to construct training masks and adopting adaptive decoding strategies, it achieves seamless, smooth-boundary speech editing with natural transitions. Experimental results demonstrate that models trained on LEMAS-Dataset deliver high-quality synthesis and editing performance, confirming the dataset's quality. We envision that this richly timestamp-annotated, fine-grained multilingual corpus will drive future advances in prompt-based speech generation systems.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_04233
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	LEMAS: Large A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models Zhao, Zhiyuan Lin, Lijian Zhu, Ye Xie, Kai Liu, Yunfei Li, Yu Sound Audio and Speech Processing We present the LEMAS-Dataset, which, to our knowledge, is currently the largest open-source multilingual speech corpus with word-level timestamps. Covering over 150,000 hours across 10 major languages, LEMAS-Dataset is constructed via a efficient data processing pipeline that ensures high-quality data and annotations. To validate the effectiveness of LEMAS-Dataset across diverse generative paradigms, we train two benchmark models with distinct architectures and task specializations on this dataset. LEMAS-TTS, built upon a non-autoregressive flow-matching framework, leverages the dataset's massive scale and linguistic diversity to achieve robust zero-shot multilingual synthesis. Our proposed accent-adversarial training and CTC loss mitigate cross-lingual accent issues, enhancing synthesis stability. Complementarily, LEMAS-Edit employs an autoregressive decoder-only architecture that formulates speech editing as a masked token infilling task. By exploiting precise word-level alignments to construct training masks and adopting adaptive decoding strategies, it achieves seamless, smooth-boundary speech editing with natural transitions. Experimental results demonstrate that models trained on LEMAS-Dataset deliver high-quality synthesis and editing performance, confirming the dataset's quality. We envision that this richly timestamp-annotated, fine-grained multilingual corpus will drive future advances in prompt-based speech generation systems.
title	LEMAS: Large A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models
topic	Sound Audio and Speech Processing
url	https://arxiv.org/abs/2601.04233

Similar Items