:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Piękos, Piotr, Csordás, Róbert, Schmidhuber, Jürgen
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2505.00315
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
by: Csordás, Róbert, et al.
Published: (2023)

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
by: Nawrot, Piotr, et al.
Published: (2025)

Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
by: He, Mutian, et al.
Published: (2025)

MoEUT: Mixture-of-Experts Universal Transformers
by: Csordás, Róbert, et al.
Published: (2024)

Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings
by: Gopalakrishnan, Anand, et al.
Published: (2025)

Routing Absorption in Sparse Attention: Why Random Gates Are Hard to Beat
by: Aquino-Michaels, Keston
Published: (2026)

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
by: Yuan, Jingyang, et al.
Published: (2025)

Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models
by: Wang, Zihan, et al.
Published: (2025)

Self-Organising Neural Discrete Representation Learning à la Kohonen
by: Irie, Kazuki, et al.
Published: (2023)

How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse
by: Deng, Yichuan, et al.
Published: (2024)

SEA: Sparse Linear Attention with Estimated Attention Mask
by: Lee, Heejun, et al.
Published: (2023)

CoSMoEs: Compact Sparse Mixture of Experts
by: Huber, Patrick, et al.
Published: (2025)

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts
by: Zhang, Zeliang, et al.
Published: (2024)

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
by: Ahrac, Sagi, et al.
Published: (2026)

Block Sparse Flash Attention
by: Ohayon, Daniel, et al.
Published: (2025)

SLA2: Sparse-Linear Attention with Learnable Routing and QAT
by: Zhang, Jintao, et al.
Published: (2026)

SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts
by: Muzio, Alexandre, et al.
Published: (2024)

LoLA: Low-Rank Linear Attention With Sparse Caching
by: McDermott, Luke, et al.
Published: (2025)

PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention
by: Chen, Lida, et al.
Published: (2025)

ProxyAttn: Guided Sparse Attention via Representative Heads
by: Wang, Yixuan, et al.
Published: (2025)

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention
by: Huang, Yuxiang, et al.
Published: (2026)

Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
by: Gu, Zijin, et al.
Published: (2025)

AdaSplash: Adaptive Sparse Flash Attention
by: Gonçalves, Nuno, et al.
Published: (2025)

Scaling Linear Attention with Sparse State Expansion
by: Pan, Yuqi, et al.
Published: (2025)

Measuring In-Context Computation Complexity via Hidden State Prediction
by: Herrmann, Vincent, et al.
Published: (2025)

NOSA: Native and Offloadable Sparse Attention
by: Huang, Yuxiang, et al.
Published: (2025)

SpecAttn: Speculating Sparse Attention
by: Shah, Harsh
Published: (2025)

Trainable Dynamic Mask Sparse Attention
by: Shi, Jingze, et al.
Published: (2025)

HSR-Enhanced Sparse Attention Acceleration
by: Chen, Bo, et al.
Published: (2024)

Sparse Attention across Multiple-context KV Cache
by: Cao, Ziyi, et al.
Published: (2025)

STS: Efficient Sparse Attention with Speculative Token Sparsity
by: Xu, Ceyu, et al.
Published: (2026)

AdaSplash-2: Faster Differentiable Sparse Attention
by: Gonçalves, Nuno, et al.
Published: (2026)

Metalearning Continual Learning Algorithms
by: Irie, Kazuki, et al.
Published: (2023)

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
by: Song, Chenyang, et al.
Published: (2026)

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
by: Bai, Yushi, et al.
Published: (2026)

Maximum Score Routing For Mixture-of-Experts
by: Dong, Bowen, et al.
Published: (2025)

Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction
by: Filipek, Adam
Published: (2025)

Sparse Attention Decomposition Applied to Circuit Tracing
by: Franco, Gabriel, et al.
Published: (2024)

Post-Training Sparse Attention with Double Sparsity
by: Yang, Shuo, et al.
Published: (2024)

BiSparse-AAS: Bilinear Sparse Attention and Adaptive Spans Framework for Scalable and Efficient Text Summarization
by: Hagos, Desta Haileselassie, et al.
Published: (2025)