Salvato in:
Dettagli Bibliografici
Autori principali: Ternovtsii, Ivan, Bilak, Yurii
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2605.09516
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866915999805079552
author Ternovtsii, Ivan
Bilak, Yurii
author_facet Ternovtsii, Ivan
Bilak, Yurii
contents Standard Mixture-of-Experts (MoE) transformers route tokens to expert subnetworks within each layer, but the layer structure itself remains monolithic. We introduce Mixture of Layers (MoL), which replaces full-width transformer blocks (d_model) with K parallel thin blocks at reduced dimensionality (d_thin << d_model), connected via learned down/up projections and composed via top-k block routing. Scaling sparse block routing to many blocks creates an attention coverage problem, as each block sees fewer tokens. We address this by introducing hybrid attention, which pairs one shared softmax block for global context with Gated DeltaNet linear attention in routed blocks.
format Preprint
id arxiv_https___arxiv_org_abs_2605_09516
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Mixture of Layers with Hybrid Attention
Ternovtsii, Ivan
Bilak, Yurii
Machine Learning
Artificial Intelligence
Standard Mixture-of-Experts (MoE) transformers route tokens to expert subnetworks within each layer, but the layer structure itself remains monolithic. We introduce Mixture of Layers (MoL), which replaces full-width transformer blocks (d_model) with K parallel thin blocks at reduced dimensionality (d_thin << d_model), connected via learned down/up projections and composed via top-k block routing. Scaling sparse block routing to many blocks creates an attention coverage problem, as each block sees fewer tokens. We address this by introducing hybrid attention, which pairs one shared softmax block for global context with Gated DeltaNet linear attention in routed blocks.
title Mixture of Layers with Hybrid Attention
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2605.09516