Saved in:
Bibliographic Details
Main Authors: Allal, Loubna Ben, Lozhkov, Anton, Bakouch, Elie, Blázquez, Gabriel Martín, Penedo, Guilherme, Tunstall, Lewis, Marafioti, Andrés, Kydlíček, Hynek, Lajarín, Agustín Piqueres, Srivastav, Vaibhav, Lochner, Joshua, Fahlgren, Caleb, Nguyen, Xuan-Son, Fourrier, Clémentine, Burtenshaw, Ben, Larcher, Hugo, Zhao, Haojun, Zakka, Cyril, Morlon, Mathieu, Raffel, Colin, von Werra, Leandro, Wolf, Thomas
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.02737
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929698708127744
author Allal, Loubna Ben
Lozhkov, Anton
Bakouch, Elie
Blázquez, Gabriel Martín
Penedo, Guilherme
Tunstall, Lewis
Marafioti, Andrés
Kydlíček, Hynek
Lajarín, Agustín Piqueres
Srivastav, Vaibhav
Lochner, Joshua
Fahlgren, Caleb
Nguyen, Xuan-Son
Fourrier, Clémentine
Burtenshaw, Ben
Larcher, Hugo
Zhao, Haojun
Zakka, Cyril
Morlon, Mathieu
Raffel, Colin
von Werra, Leandro
Wolf, Thomas
author_facet Allal, Loubna Ben
Lozhkov, Anton
Bakouch, Elie
Blázquez, Gabriel Martín
Penedo, Guilherme
Tunstall, Lewis
Marafioti, Andrés
Kydlíček, Hynek
Lajarín, Agustín Piqueres
Srivastav, Vaibhav
Lochner, Joshua
Fahlgren, Caleb
Nguyen, Xuan-Son
Fourrier, Clémentine
Burtenshaw, Ben
Larcher, Hugo
Zhao, Haojun
Zakka, Cyril
Morlon, Mathieu
Raffel, Colin
von Werra, Leandro
Wolf, Thomas
contents While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.
format Preprint
id arxiv_https___arxiv_org_abs_2502_02737
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Allal, Loubna Ben
Lozhkov, Anton
Bakouch, Elie
Blázquez, Gabriel Martín
Penedo, Guilherme
Tunstall, Lewis
Marafioti, Andrés
Kydlíček, Hynek
Lajarín, Agustín Piqueres
Srivastav, Vaibhav
Lochner, Joshua
Fahlgren, Caleb
Nguyen, Xuan-Son
Fourrier, Clémentine
Burtenshaw, Ben
Larcher, Hugo
Zhao, Haojun
Zakka, Cyril
Morlon, Mathieu
Raffel, Colin
von Werra, Leandro
Wolf, Thomas
Computation and Language
While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.
title SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
topic Computation and Language
url https://arxiv.org/abs/2502.02737