Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	de Gibert, Ona, Nail, Graeme, Arefyev, Nikolay, Bañón, Marta, van der Linde, Jelmer, Ji, Shaoxiong, Zaragoza-Bernabeu, Jaume, Aulamo, Mikko, Ramírez-Sánchez, Gema, Kutuzov, Andrey, Pyysalo, Sampo, Oepen, Stephan, Tiedemann, Jörg
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2403.14009
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913276077539328
author	de Gibert, Ona Nail, Graeme Arefyev, Nikolay Bañón, Marta van der Linde, Jelmer Ji, Shaoxiong Zaragoza-Bernabeu, Jaume Aulamo, Mikko Ramírez-Sánchez, Gema Kutuzov, Andrey Pyysalo, Sampo Oepen, Stephan Tiedemann, Jörg
author_facet	de Gibert, Ona Nail, Graeme Arefyev, Nikolay Bañón, Marta van der Linde, Jelmer Ji, Shaoxiong Zaragoza-Bernabeu, Jaume Aulamo, Mikko Ramírez-Sánchez, Gema Kutuzov, Andrey Pyysalo, Sampo Oepen, Stephan Tiedemann, Jörg
contents	We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_14009
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	A New Massive Multilingual Dataset for High-Performance Language Technologies de Gibert, Ona Nail, Graeme Arefyev, Nikolay Bañón, Marta van der Linde, Jelmer Ji, Shaoxiong Zaragoza-Bernabeu, Jaume Aulamo, Mikko Ramírez-Sánchez, Gema Kutuzov, Andrey Pyysalo, Sampo Oepen, Stephan Tiedemann, Jörg Computation and Language We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.
title	A New Massive Multilingual Dataset for High-Performance Language Technologies
topic	Computation and Language
url	https://arxiv.org/abs/2403.14009

Similar Items