Saved in:
| Main Authors: | Piękos, Piotr, Csordás, Róbert, Schmidhuber, Jürgen |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.00315 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
by: Csordás, Róbert, et al.
Published: (2023)
by: Csordás, Róbert, et al.
Published: (2023)
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
by: Nawrot, Piotr, et al.
Published: (2025)
by: Nawrot, Piotr, et al.
Published: (2025)
Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
by: He, Mutian, et al.
Published: (2025)
by: He, Mutian, et al.
Published: (2025)
MoEUT: Mixture-of-Experts Universal Transformers
by: Csordás, Róbert, et al.
Published: (2024)
by: Csordás, Róbert, et al.
Published: (2024)
Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings
by: Gopalakrishnan, Anand, et al.
Published: (2025)
by: Gopalakrishnan, Anand, et al.
Published: (2025)
Routing Absorption in Sparse Attention: Why Random Gates Are Hard to Beat
by: Aquino-Michaels, Keston
Published: (2026)
by: Aquino-Michaels, Keston
Published: (2026)
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
by: Yuan, Jingyang, et al.
Published: (2025)
by: Yuan, Jingyang, et al.
Published: (2025)
Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models
by: Wang, Zihan, et al.
Published: (2025)
by: Wang, Zihan, et al.
Published: (2025)
Self-Organising Neural Discrete Representation Learning à la Kohonen
by: Irie, Kazuki, et al.
Published: (2023)
by: Irie, Kazuki, et al.
Published: (2023)
How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse
by: Deng, Yichuan, et al.
Published: (2024)
by: Deng, Yichuan, et al.
Published: (2024)
SEA: Sparse Linear Attention with Estimated Attention Mask
by: Lee, Heejun, et al.
Published: (2023)
by: Lee, Heejun, et al.
Published: (2023)
CoSMoEs: Compact Sparse Mixture of Experts
by: Huber, Patrick, et al.
Published: (2025)
by: Huber, Patrick, et al.
Published: (2025)
Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts
by: Zhang, Zeliang, et al.
Published: (2024)
by: Zhang, Zeliang, et al.
Published: (2024)
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
by: Ahrac, Sagi, et al.
Published: (2026)
by: Ahrac, Sagi, et al.
Published: (2026)
Block Sparse Flash Attention
by: Ohayon, Daniel, et al.
Published: (2025)
by: Ohayon, Daniel, et al.
Published: (2025)
SLA2: Sparse-Linear Attention with Learnable Routing and QAT
by: Zhang, Jintao, et al.
Published: (2026)
by: Zhang, Jintao, et al.
Published: (2026)
SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts
by: Muzio, Alexandre, et al.
Published: (2024)
by: Muzio, Alexandre, et al.
Published: (2024)
LoLA: Low-Rank Linear Attention With Sparse Caching
by: McDermott, Luke, et al.
Published: (2025)
by: McDermott, Luke, et al.
Published: (2025)
PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention
by: Chen, Lida, et al.
Published: (2025)
by: Chen, Lida, et al.
Published: (2025)
ProxyAttn: Guided Sparse Attention via Representative Heads
by: Wang, Yixuan, et al.
Published: (2025)
by: Wang, Yixuan, et al.
Published: (2025)
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention
by: Huang, Yuxiang, et al.
Published: (2026)
by: Huang, Yuxiang, et al.
Published: (2026)
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
by: Gu, Zijin, et al.
Published: (2025)
by: Gu, Zijin, et al.
Published: (2025)
AdaSplash: Adaptive Sparse Flash Attention
by: Gonçalves, Nuno, et al.
Published: (2025)
by: Gonçalves, Nuno, et al.
Published: (2025)
Scaling Linear Attention with Sparse State Expansion
by: Pan, Yuqi, et al.
Published: (2025)
by: Pan, Yuqi, et al.
Published: (2025)
Measuring In-Context Computation Complexity via Hidden State Prediction
by: Herrmann, Vincent, et al.
Published: (2025)
by: Herrmann, Vincent, et al.
Published: (2025)
NOSA: Native and Offloadable Sparse Attention
by: Huang, Yuxiang, et al.
Published: (2025)
by: Huang, Yuxiang, et al.
Published: (2025)
SpecAttn: Speculating Sparse Attention
by: Shah, Harsh
Published: (2025)
by: Shah, Harsh
Published: (2025)
Trainable Dynamic Mask Sparse Attention
by: Shi, Jingze, et al.
Published: (2025)
by: Shi, Jingze, et al.
Published: (2025)
HSR-Enhanced Sparse Attention Acceleration
by: Chen, Bo, et al.
Published: (2024)
by: Chen, Bo, et al.
Published: (2024)
Sparse Attention across Multiple-context KV Cache
by: Cao, Ziyi, et al.
Published: (2025)
by: Cao, Ziyi, et al.
Published: (2025)
STS: Efficient Sparse Attention with Speculative Token Sparsity
by: Xu, Ceyu, et al.
Published: (2026)
by: Xu, Ceyu, et al.
Published: (2026)
AdaSplash-2: Faster Differentiable Sparse Attention
by: Gonçalves, Nuno, et al.
Published: (2026)
by: Gonçalves, Nuno, et al.
Published: (2026)
Metalearning Continual Learning Algorithms
by: Irie, Kazuki, et al.
Published: (2023)
by: Irie, Kazuki, et al.
Published: (2023)
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
by: Song, Chenyang, et al.
Published: (2026)
by: Song, Chenyang, et al.
Published: (2026)
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
by: Bai, Yushi, et al.
Published: (2026)
by: Bai, Yushi, et al.
Published: (2026)
Maximum Score Routing For Mixture-of-Experts
by: Dong, Bowen, et al.
Published: (2025)
by: Dong, Bowen, et al.
Published: (2025)
Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction
by: Filipek, Adam
Published: (2025)
by: Filipek, Adam
Published: (2025)
Sparse Attention Decomposition Applied to Circuit Tracing
by: Franco, Gabriel, et al.
Published: (2024)
by: Franco, Gabriel, et al.
Published: (2024)
Post-Training Sparse Attention with Double Sparsity
by: Yang, Shuo, et al.
Published: (2024)
by: Yang, Shuo, et al.
Published: (2024)
BiSparse-AAS: Bilinear Sparse Attention and Adaptive Spans Framework for Scalable and Efficient Text Summarization
by: Hagos, Desta Haileselassie, et al.
Published: (2025)
by: Hagos, Desta Haileselassie, et al.
Published: (2025)
Similar Items
-
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
by: Csordás, Róbert, et al.
Published: (2023) -
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
by: Nawrot, Piotr, et al.
Published: (2025) -
Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
by: He, Mutian, et al.
Published: (2025) -
MoEUT: Mixture-of-Experts Universal Transformers
by: Csordás, Róbert, et al.
Published: (2024) -
Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings
by: Gopalakrishnan, Anand, et al.
Published: (2025)