Saved in:
| Main Authors: | Guo, Han, Yang, Songlin, Goel, Tarushii, Xing, Eric P., Dao, Tri, Kim, Yoon |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.04761 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Gated Linear Attention Transformers with Hardware-Efficient Training
by: Yang, Songlin, et al.
Published: (2023)
by: Yang, Songlin, et al.
Published: (2023)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
by: Gu, Albert, et al.
Published: (2023)
by: Gu, Albert, et al.
Published: (2023)
CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
by: Guo, Han, et al.
Published: (2026)
by: Guo, Han, et al.
Published: (2026)
Hardware-Efficient Attention for Fast Decoding
by: Zadouri, Ted, et al.
Published: (2025)
by: Zadouri, Ted, et al.
Published: (2025)
Parallelizing Linear Transformers with the Delta Rule over Sequence Length
by: Yang, Songlin, et al.
Published: (2024)
by: Yang, Songlin, et al.
Published: (2024)
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning
by: Guo, Han, et al.
Published: (2023)
by: Guo, Han, et al.
Published: (2023)
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
by: Dao, Tri, et al.
Published: (2024)
by: Dao, Tri, et al.
Published: (2024)
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
by: Shah, Jay, et al.
Published: (2024)
by: Shah, Jay, et al.
Published: (2024)
Fast KV Compaction via Attention Matching
by: Zweiger, Adam, et al.
Published: (2026)
by: Zweiger, Adam, et al.
Published: (2026)
Adaptive Memory Decay for Log-Linear Attention
by: Amin, Yaxita, et al.
Published: (2026)
by: Amin, Yaxita, et al.
Published: (2026)
Speculative Speculative Decoding
by: Kumar, Tanishq, et al.
Published: (2026)
by: Kumar, Tanishq, et al.
Published: (2026)
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
by: Mishra, Mayank, et al.
Published: (2026)
by: Mishra, Mayank, et al.
Published: (2026)
PaTH Attention: Position Encoding via Accumulating Householder Transformations
by: Yang, Songlin, et al.
Published: (2025)
by: Yang, Songlin, et al.
Published: (2025)
In-Context Learning in Linear vs. Quadratic Attention Models: An Empirical Study on Regression Tasks
by: Goel, Ayush, et al.
Published: (2026)
by: Goel, Ayush, et al.
Published: (2026)
SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations
by: Guo, Wentao, et al.
Published: (2025)
by: Guo, Wentao, et al.
Published: (2025)
Linear Log-Normal Attention with Unbiased Concentration
by: Nahshan, Yury, et al.
Published: (2023)
by: Nahshan, Yury, et al.
Published: (2023)
Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers
by: Hwang, Sukjun, et al.
Published: (2024)
by: Hwang, Sukjun, et al.
Published: (2024)
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
by: Guo, Han, et al.
Published: (2024)
by: Guo, Han, et al.
Published: (2024)
Multiclass Calibration Assessment and Recalibration of Probability Predictions via the Linear Log Odds Calibration Function
by: Vennos, Amy, et al.
Published: (2026)
by: Vennos, Amy, et al.
Published: (2026)
Cottention: Linear Transformers With Cosine Attention
by: Mongaras, Gabriel, et al.
Published: (2024)
by: Mongaras, Gabriel, et al.
Published: (2024)
ZeroS: Zero-Sum Linear Attention for Efficient Transformers
by: Lu, Jiecheng, et al.
Published: (2026)
by: Lu, Jiecheng, et al.
Published: (2026)
Learning to Interpret Weight Differences in Language Models
by: Goel, Avichal, et al.
Published: (2025)
by: Goel, Avichal, et al.
Published: (2025)
The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models
by: Yang, Xikang, et al.
Published: (2024)
by: Yang, Xikang, et al.
Published: (2024)
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
by: Wang, Junxiong, et al.
Published: (2024)
by: Wang, Junxiong, et al.
Published: (2024)
Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning
by: Goel, Gautam, et al.
Published: (2026)
by: Goel, Gautam, et al.
Published: (2026)
BitDelta: Your Fine-Tune May Only Be Worth One Bit
by: Liu, James, et al.
Published: (2024)
by: Liu, James, et al.
Published: (2024)
SEA: Sparse Linear Attention with Estimated Attention Mask
by: Lee, Heejun, et al.
Published: (2023)
by: Lee, Heejun, et al.
Published: (2023)
Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study
by: Tan, Shawn, et al.
Published: (2024)
by: Tan, Shawn, et al.
Published: (2024)
Superiority of Multi-Head Attention in In-Context Linear Regression
by: Cui, Yingqian, et al.
Published: (2024)
by: Cui, Yingqian, et al.
Published: (2024)
Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling
by: Schiff, Yair, et al.
Published: (2024)
by: Schiff, Yair, et al.
Published: (2024)
RAM-Net: Expressive Linear Attention with Selectively Addressable Memory
by: Xiao, Kaicheng, et al.
Published: (2026)
by: Xiao, Kaicheng, et al.
Published: (2026)
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
by: Wang, Junxiong, et al.
Published: (2025)
by: Wang, Junxiong, et al.
Published: (2025)
Attention Is All You Need for KV Cache in Diffusion LLMs
by: Nguyen-Tri, Quan, et al.
Published: (2025)
by: Nguyen-Tri, Quan, et al.
Published: (2025)
Inductive Spatio-Temporal Kriging with Physics-Guided Increment Training Strategy for Air Quality Inference
by: Yang, Songlin, et al.
Published: (2025)
by: Yang, Songlin, et al.
Published: (2025)
Graph Attention-based Deep Reinforcement Learning for solving the Chinese Postman Problem with Load-dependent costs
by: Hy, Truong Son, et al.
Published: (2023)
by: Hy, Truong Son, et al.
Published: (2023)
Scaling Bidirectional Spans and Span Violations in Attention Mechanism
by: Kim, Jongwook, et al.
Published: (2025)
by: Kim, Jongwook, et al.
Published: (2025)
Conditional Generative Framework with Peak-Aware Attention for Robust Chemical Detection under Interferences
by: Yoon, Namkyung, et al.
Published: (2026)
by: Yoon, Namkyung, et al.
Published: (2026)
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
by: Goldstein, Daniel, et al.
Published: (2025)
by: Goldstein, Daniel, et al.
Published: (2025)
Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping
by: Zhang, Muru, et al.
Published: (2025)
by: Zhang, Muru, et al.
Published: (2025)
On the Duality between Gradient Transformations and Adapters
by: Torroba-Hennigen, Lucas, et al.
Published: (2025)
by: Torroba-Hennigen, Lucas, et al.
Published: (2025)
Similar Items
-
Gated Linear Attention Transformers with Hardware-Efficient Training
by: Yang, Songlin, et al.
Published: (2023) -
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
by: Gu, Albert, et al.
Published: (2023) -
CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
by: Guo, Han, et al.
Published: (2026) -
Hardware-Efficient Attention for Fast Decoding
by: Zadouri, Ted, et al.
Published: (2025) -
Parallelizing Linear Transformers with the Delta Rule over Sequence Length
by: Yang, Songlin, et al.
Published: (2024)