Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Huang, Jing, Wurgaft, Daniel, Bansal, Rachit, Ruis, Laura, Saphra, Naomi, Alvarez-Melis, David, Lampinen, Andrew Kyle, Potts, Christopher, Lubana, Ekdeep Singh
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2605.29548
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918534947274752
author	Huang, Jing Wurgaft, Daniel Bansal, Rachit Ruis, Laura Saphra, Naomi Alvarez-Melis, David Lampinen, Andrew Kyle Potts, Christopher Lubana, Ekdeep Singh
author_facet	Huang, Jing Wurgaft, Daniel Bansal, Rachit Ruis, Laura Saphra, Naomi Alvarez-Melis, David Lampinen, Andrew Kyle Potts, Christopher Lubana, Ekdeep Singh
contents	Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity. The results mirror those from our synthetic data experiments: only the larger OLMo models learn the infrequent and complex tasks, and these larger models embed more task features in their representations and show less gradient interference between tasks. Overall, we offer a data-centric account of why larger models learn tasks that smaller models fail to. This helps explain why larger models are better in practice, and it can inform practical questions concerning model sizing and training data mixtures.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_29548
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention Huang, Jing Wurgaft, Daniel Bansal, Rachit Ruis, Laura Saphra, Naomi Alvarez-Melis, David Lampinen, Andrew Kyle Potts, Christopher Lubana, Ekdeep Singh Machine Learning Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity. The results mirror those from our synthetic data experiments: only the larger OLMo models learn the infrequent and complex tasks, and these larger models embed more task features in their representations and show less gradient interference between tasks. Overall, we offer a data-centric account of why larger models learn tasks that smaller models fail to. This helps explain why larger models are better in practice, and it can inform practical questions concerning model sizing and training data mixtures.
title	Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
topic	Machine Learning
url	https://arxiv.org/abs/2605.29548

Similar Items