Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Merrill, William, Li, Yanhong, Romero, Tyler, Svete, Anej, Costello, Caia, Dasigi, Pradeep, Groeneveld, Dirk, Heineman, David, Kuehl, Bailey, Lambert, Nathan, Li, Chuan, Lo, Kyle, Malik, Saumya, Matusz, DJ, Minixhofer, Benjamin, Morrison, Jacob, Soldaini, Luca, Timbers, Finbarr, Walsh, Pete, Smith, Noah A., Hajishirzi, Hannaneh, Sabharwal, Ashish
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2604.03444
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910137892995072
author	Merrill, William Li, Yanhong Romero, Tyler Svete, Anej Costello, Caia Dasigi, Pradeep Groeneveld, Dirk Heineman, David Kuehl, Bailey Lambert, Nathan Li, Chuan Lo, Kyle Malik, Saumya Matusz, DJ Minixhofer, Benjamin Morrison, Jacob Soldaini, Luca Timbers, Finbarr Walsh, Pete Smith, Noah A. Hajishirzi, Hannaneh Sabharwal, Ashish
author_facet	Merrill, William Li, Yanhong Romero, Tyler Svete, Anej Costello, Caia Dasigi, Pradeep Groeneveld, Dirk Heineman, David Kuehl, Bailey Lambert, Nathan Li, Chuan Lo, Kyle Malik, Saumya Matusz, DJ Minixhofer, Benjamin Morrison, Jacob Soldaini, Luca Timbers, Finbarr Walsh, Pete Smith, Noah A. Hajishirzi, Hannaneh Sabharwal, Ashish
contents	Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_03444
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Olmo Hybrid: From Theory to Practice and Back Merrill, William Li, Yanhong Romero, Tyler Svete, Anej Costello, Caia Dasigi, Pradeep Groeneveld, Dirk Heineman, David Kuehl, Bailey Lambert, Nathan Li, Chuan Lo, Kyle Malik, Saumya Matusz, DJ Minixhofer, Benjamin Morrison, Jacob Soldaini, Luca Timbers, Finbarr Walsh, Pete Smith, Noah A. Hajishirzi, Hannaneh Sabharwal, Ashish Machine Learning Computation and Language Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.
title	Olmo Hybrid: From Theory to Practice and Back
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2604.03444

Similar Items