Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.05209 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915328466878464 |
|---|---|
| author | Kandpal, Nikhil Lester, Brian Raffel, Colin Majstorovic, Sebastian Biderman, Stella Abbasi, Baber Soldaini, Luca Shippole, Enrico Cooper, A. Feder Skowron, Aviya Kirchenbauer, John Longpre, Shayne Sutawika, Lintang Albalak, Alon Xu, Zhenlin Penedo, Guilherme Allal, Loubna Ben Bakouch, Elie Pressman, John David Fan, Honglu Stander, Dashiell Song, Guangyu Gokaslan, Aaron Goldstein, Tom Bartoldson, Brian R. Kailkhura, Bhavya Murray, Tyler |
| author_facet | Kandpal, Nikhil Lester, Brian Raffel, Colin Majstorovic, Sebastian Biderman, Stella Abbasi, Baber Soldaini, Luca Shippole, Enrico Cooper, A. Feder Skowron, Aviya Kirchenbauer, John Longpre, Shayne Sutawika, Lintang Albalak, Alon Xu, Zhenlin Penedo, Guilherme Allal, Loubna Ben Bakouch, Elie Pressman, John David Fan, Honglu Stander, Dashiell Song, Guangyu Gokaslan, Aaron Goldstein, Tom Bartoldson, Brian R. Kailkhura, Bhavya Murray, Tyler |
| contents | Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2506_05209 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Kandpal, Nikhil Lester, Brian Raffel, Colin Majstorovic, Sebastian Biderman, Stella Abbasi, Baber Soldaini, Luca Shippole, Enrico Cooper, A. Feder Skowron, Aviya Kirchenbauer, John Longpre, Shayne Sutawika, Lintang Albalak, Alon Xu, Zhenlin Penedo, Guilherme Allal, Loubna Ben Bakouch, Elie Pressman, John David Fan, Honglu Stander, Dashiell Song, Guangyu Gokaslan, Aaron Goldstein, Tom Bartoldson, Brian R. Kailkhura, Bhavya Murray, Tyler Computation and Language Machine Learning Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models. |
| title | The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text |
| topic | Computation and Language Machine Learning |
| url | https://arxiv.org/abs/2506.05209 |