Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lee, Dan, Han, Seungwook, Kumar, Akarsh, Agrawal, Pulkit
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2603.10055
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917331036274688
author	Lee, Dan Han, Seungwook Kumar, Akarsh Agrawal, Pulkit
author_facet	Lee, Dan Han, Seungwook Kumar, Akarsh Agrawal, Pulkit
contents	Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_10055
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Training Language Models via Neural Cellular Automata Lee, Dan Han, Seungwook Kumar, Akarsh Agrawal, Pulkit Machine Learning Artificial Intelligence Computation and Language Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.
title	Training Language Models via Neural Cellular Automata
topic	Machine Learning Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2603.10055

Similar Items