Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Shen, Yingli, Lai, Wen, Wang, Shuo, Zhang, Xueren, Luo, Kangyang, Fraser, Alexander, Sun, Maosong
Formato:	Preprint
Publicado:	2025
Materias:	Computation and Language
Acceso en línea:	https://arxiv.org/abs/2502.11546
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866915572651917312
author	Shen, Yingli Lai, Wen Wang, Shuo Zhang, Xueren Luo, Kangyang Fraser, Alexander Sun, Maosong
author_facet	Shen, Yingli Lai, Wen Wang, Shuo Zhang, Xueren Luo, Kangyang Fraser, Alexander Sun, Maosong
contents	The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and well-curated multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus constructed from newly extracted Common Crawl data and existing multilingual sources. DCAD-2000 covers 2,282 languages, 46.72TB of text, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of existing data cleaning approaches, which rely on manually designed heuristic thresholds, we reframe data cleaning as an anomaly detection problem. This dynamic filtering paradigm substantially improves data quality by automatically identifying and removing noisy or anomalous content. By fine-tuning LLMs on DCAD-2000, we demonstrate notable improvements in data quality, robustness of the cleaning pipeline, and downstream performance, particularly for low-resource languages across multiple multilingual benchmarks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_11546
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection Shen, Yingli Lai, Wen Wang, Shuo Zhang, Xueren Luo, Kangyang Fraser, Alexander Sun, Maosong Computation and Language The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and well-curated multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus constructed from newly extracted Common Crawl data and existing multilingual sources. DCAD-2000 covers 2,282 languages, 46.72TB of text, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of existing data cleaning approaches, which rely on manually designed heuristic thresholds, we reframe data cleaning as an anomaly detection problem. This dynamic filtering paradigm substantially improves data quality by automatically identifying and removing noisy or anomalous content. By fine-tuning LLMs on DCAD-2000, we demonstrate notable improvements in data quality, robustness of the cleaning pipeline, and downstream performance, particularly for low-resource languages across multiple multilingual benchmarks.
title	DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection
topic	Computation and Language
url	https://arxiv.org/abs/2502.11546

Ejemplares similares