Guardado en:
Detalles Bibliográficos
Autores principales: Shen, Yingli, Lai, Wen, Wang, Shuo, Zhang, Xueren, Luo, Kangyang, Fraser, Alexander, Sun, Maosong
Formato: Preprint
Publicado: 2025
Materias:
Acceso en línea:https://arxiv.org/abs/2502.11546
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866915572651917312
author Shen, Yingli
Lai, Wen
Wang, Shuo
Zhang, Xueren
Luo, Kangyang
Fraser, Alexander
Sun, Maosong
author_facet Shen, Yingli
Lai, Wen
Wang, Shuo
Zhang, Xueren
Luo, Kangyang
Fraser, Alexander
Sun, Maosong
contents The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and well-curated multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus constructed from newly extracted Common Crawl data and existing multilingual sources. DCAD-2000 covers 2,282 languages, 46.72TB of text, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of existing data cleaning approaches, which rely on manually designed heuristic thresholds, we reframe data cleaning as an anomaly detection problem. This dynamic filtering paradigm substantially improves data quality by automatically identifying and removing noisy or anomalous content. By fine-tuning LLMs on DCAD-2000, we demonstrate notable improvements in data quality, robustness of the cleaning pipeline, and downstream performance, particularly for low-resource languages across multiple multilingual benchmarks.
format Preprint
id arxiv_https___arxiv_org_abs_2502_11546
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection
Shen, Yingli
Lai, Wen
Wang, Shuo
Zhang, Xueren
Luo, Kangyang
Fraser, Alexander
Sun, Maosong
Computation and Language
The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and well-curated multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus constructed from newly extracted Common Crawl data and existing multilingual sources. DCAD-2000 covers 2,282 languages, 46.72TB of text, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of existing data cleaning approaches, which rely on manually designed heuristic thresholds, we reframe data cleaning as an anomaly detection problem. This dynamic filtering paradigm substantially improves data quality by automatically identifying and removing noisy or anomalous content. By fine-tuning LLMs on DCAD-2000, we demonstrate notable improvements in data quality, robustness of the cleaning pipeline, and downstream performance, particularly for low-resource languages across multiple multilingual benchmarks.
title DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection
topic Computation and Language
url https://arxiv.org/abs/2502.11546