_version_ 1866915328466878464
author Kandpal, Nikhil
Lester, Brian
Raffel, Colin
Majstorovic, Sebastian
Biderman, Stella
Abbasi, Baber
Soldaini, Luca
Shippole, Enrico
Cooper, A. Feder
Skowron, Aviya
Kirchenbauer, John
Longpre, Shayne
Sutawika, Lintang
Albalak, Alon
Xu, Zhenlin
Penedo, Guilherme
Allal, Loubna Ben
Bakouch, Elie
Pressman, John David
Fan, Honglu
Stander, Dashiell
Song, Guangyu
Gokaslan, Aaron
Goldstein, Tom
Bartoldson, Brian R.
Kailkhura, Bhavya
Murray, Tyler
author_facet Kandpal, Nikhil
Lester, Brian
Raffel, Colin
Majstorovic, Sebastian
Biderman, Stella
Abbasi, Baber
Soldaini, Luca
Shippole, Enrico
Cooper, A. Feder
Skowron, Aviya
Kirchenbauer, John
Longpre, Shayne
Sutawika, Lintang
Albalak, Alon
Xu, Zhenlin
Penedo, Guilherme
Allal, Loubna Ben
Bakouch, Elie
Pressman, John David
Fan, Honglu
Stander, Dashiell
Song, Guangyu
Gokaslan, Aaron
Goldstein, Tom
Bartoldson, Brian R.
Kailkhura, Bhavya
Murray, Tyler
contents Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.
format Preprint
id arxiv_https___arxiv_org_abs_2506_05209
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
Kandpal, Nikhil
Lester, Brian
Raffel, Colin
Majstorovic, Sebastian
Biderman, Stella
Abbasi, Baber
Soldaini, Luca
Shippole, Enrico
Cooper, A. Feder
Skowron, Aviya
Kirchenbauer, John
Longpre, Shayne
Sutawika, Lintang
Albalak, Alon
Xu, Zhenlin
Penedo, Guilherme
Allal, Loubna Ben
Bakouch, Elie
Pressman, John David
Fan, Honglu
Stander, Dashiell
Song, Guangyu
Gokaslan, Aaron
Goldstein, Tom
Bartoldson, Brian R.
Kailkhura, Bhavya
Murray, Tyler
Computation and Language
Machine Learning
Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.
title The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
topic Computation and Language
Machine Learning
url https://arxiv.org/abs/2506.05209