MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Ternovtsii, Ivan, Bilak, Yurii
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Machine Learning Artificial Intelligence
Accesso online:	https://arxiv.org/abs/2605.09516
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866915999805079552
author	Ternovtsii, Ivan Bilak, Yurii
author_facet	Ternovtsii, Ivan Bilak, Yurii
contents	Standard Mixture-of-Experts (MoE) transformers route tokens to expert subnetworks within each layer, but the layer structure itself remains monolithic. We introduce Mixture of Layers (MoL), which replaces full-width transformer blocks (d_model) with K parallel thin blocks at reduced dimensionality (d_thin << d_model), connected via learned down/up projections and composed via top-k block routing. Scaling sparse block routing to many blocks creates an attention coverage problem, as each block sees fewer tokens. We address this by introducing hybrid attention, which pairs one shared softmax block for global context with Gated DeltaNet linear attention in routed blocks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_09516
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Mixture of Layers with Hybrid Attention Ternovtsii, Ivan Bilak, Yurii Machine Learning Artificial Intelligence Standard Mixture-of-Experts (MoE) transformers route tokens to expert subnetworks within each layer, but the layer structure itself remains monolithic. We introduce Mixture of Layers (MoL), which replaces full-width transformer blocks (d_model) with K parallel thin blocks at reduced dimensionality (d_thin << d_model), connected via learned down/up projections and composed via top-k block routing. Scaling sparse block routing to many blocks creates an attention coverage problem, as each block sees fewer tokens. We address this by introducing hybrid attention, which pairs one shared softmax block for global context with Gated DeltaNet linear attention in routed blocks.
title	Mixture of Layers with Hybrid Attention
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2605.09516

Documenti analoghi