Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Steifer, Tomasz
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2605.16640
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916018036670464
author	Steifer, Tomasz
author_facet	Steifer, Tomasz
contents	We investigate the expressive power of hybrid recurrent-attention decoders, a class of architectures used in recent open-source language models such as Qwen3-Next and its successors. These models combine Gated Attention heads with recurrent Gated DeltaNet heads. Is there a formal advantage, in terms of model expressivity or efficiency, to such a hybrid architecture? We show that there is. We define parity-conditioned retrieval task and show that under constant-precision assumption, a Qwen-style hybrid of Gated DeltaNet and Gated Attention solves this task with a constant scratchpad, or equivalently $O(1)$ chain-of-thought steps. In contrast, no similar solution exists for pure Gated DeltaNet models, while pure Gated Attention requires at least a polynomial scratchpad.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_16640
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders Steifer, Tomasz Machine Learning We investigate the expressive power of hybrid recurrent-attention decoders, a class of architectures used in recent open-source language models such as Qwen3-Next and its successors. These models combine Gated Attention heads with recurrent Gated DeltaNet heads. Is there a formal advantage, in terms of model expressivity or efficiency, to such a hybrid architecture? We show that there is. We define parity-conditioned retrieval task and show that under constant-precision assumption, a Qwen-style hybrid of Gated DeltaNet and Gated Attention solves this task with a constant scratchpad, or equivalently $O(1)$ chain-of-thought steps. In contrast, no similar solution exists for pure Gated DeltaNet models, while pure Gated Attention requires at least a polynomial scratchpad.
title	Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders
topic	Machine Learning
url	https://arxiv.org/abs/2605.16640

Similar Items