Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kandpal, Nikhil, Lester, Brian, Raffel, Colin, Majstorovic, Sebastian, Biderman, Stella, Abbasi, Baber, Soldaini, Luca, Shippole, Enrico, Cooper, A. Feder, Skowron, Aviya, Kirchenbauer, John, Longpre, Shayne, Sutawika, Lintang, Albalak, Alon, Xu, Zhenlin, Penedo, Guilherme, Allal, Loubna Ben, Bakouch, Elie, Pressman, John David, Fan, Honglu, Stander, Dashiell, Song, Guangyu, Gokaslan, Aaron, Goldstein, Tom, Bartoldson, Brian R., Kailkhura, Bhavya, Murray, Tyler
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2506.05209
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915328466878464
author	Kandpal, Nikhil Lester, Brian Raffel, Colin Majstorovic, Sebastian Biderman, Stella Abbasi, Baber Soldaini, Luca Shippole, Enrico Cooper, A. Feder Skowron, Aviya Kirchenbauer, John Longpre, Shayne Sutawika, Lintang Albalak, Alon Xu, Zhenlin Penedo, Guilherme Allal, Loubna Ben Bakouch, Elie Pressman, John David Fan, Honglu Stander, Dashiell Song, Guangyu Gokaslan, Aaron Goldstein, Tom Bartoldson, Brian R. Kailkhura, Bhavya Murray, Tyler
author_facet	Kandpal, Nikhil Lester, Brian Raffel, Colin Majstorovic, Sebastian Biderman, Stella Abbasi, Baber Soldaini, Luca Shippole, Enrico Cooper, A. Feder Skowron, Aviya Kirchenbauer, John Longpre, Shayne Sutawika, Lintang Albalak, Alon Xu, Zhenlin Penedo, Guilherme Allal, Loubna Ben Bakouch, Elie Pressman, John David Fan, Honglu Stander, Dashiell Song, Guangyu Gokaslan, Aaron Goldstein, Tom Bartoldson, Brian R. Kailkhura, Bhavya Murray, Tyler
contents	Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2506_05209
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Kandpal, Nikhil Lester, Brian Raffel, Colin Majstorovic, Sebastian Biderman, Stella Abbasi, Baber Soldaini, Luca Shippole, Enrico Cooper, A. Feder Skowron, Aviya Kirchenbauer, John Longpre, Shayne Sutawika, Lintang Albalak, Alon Xu, Zhenlin Penedo, Guilherme Allal, Loubna Ben Bakouch, Elie Pressman, John David Fan, Honglu Stander, Dashiell Song, Guangyu Gokaslan, Aaron Goldstein, Tom Bartoldson, Brian R. Kailkhura, Bhavya Murray, Tyler Computation and Language Machine Learning Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.
title	The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2506.05209

Similar Items