Saved in:
| Main Authors: | Boursier, Etienne, Boyer, Claire |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.11784 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Early alignment in two-layer networks training is a two-edged sword
by: Boursier, Etienne, et al.
Published: (2024)
by: Boursier, Etienne, et al.
Published: (2024)
Simplicity bias and optimization threshold in two-layer ReLU networks
by: Boursier, Etienne, et al.
Published: (2024)
by: Boursier, Etienne, et al.
Published: (2024)
Penalising the biases in norm regularisation enforces sparsity
by: Boursier, Etienne, et al.
Published: (2023)
by: Boursier, Etienne, et al.
Published: (2023)
Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning
by: Xie, Zixuan, et al.
Published: (2026)
by: Xie, Zixuan, et al.
Published: (2026)
A survey on multi-player bandits
by: Boursier, Etienne, et al.
Published: (2022)
by: Boursier, Etienne, et al.
Published: (2022)
Why Softmax Attention Outperforms Linear Attention
by: Deng, Yichuan, et al.
Published: (2023)
by: Deng, Yichuan, et al.
Published: (2023)
Attention-based PCA
by: Maulen-Soto, Rodrigo, et al.
Published: (2026)
by: Maulen-Soto, Rodrigo, et al.
Published: (2026)
Attention-based clustering
by: Maulen-Soto, Rodrigo, et al.
Published: (2025)
by: Maulen-Soto, Rodrigo, et al.
Published: (2025)
Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs
by: Boursier, Etienne, et al.
Published: (2022)
by: Boursier, Etienne, et al.
Published: (2022)
First-order ANIL provably learns representations despite overparametrization
by: Yüksel, Oğuz Kaan, et al.
Published: (2023)
by: Yüksel, Oğuz Kaan, et al.
Published: (2023)
The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry
by: Zhang, Michael, et al.
Published: (2024)
by: Zhang, Michael, et al.
Published: (2024)
Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency
by: Nishikawa, Naoki, et al.
Published: (2025)
by: Nishikawa, Naoki, et al.
Published: (2025)
Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression
by: Zuo, Yifei, et al.
Published: (2025)
by: Zuo, Yifei, et al.
Published: (2025)
On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective
by: Mongaras, Gabriel, et al.
Published: (2025)
by: Mongaras, Gabriel, et al.
Published: (2025)
A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation
by: Boursier, Etienne, et al.
Published: (2025)
by: Boursier, Etienne, et al.
Published: (2025)
Statistical Advantage of Softmax Attention: Insights from Single-Location Regression
by: Duranthon, O., et al.
Published: (2025)
by: Duranthon, O., et al.
Published: (2025)
Benignity of loss landscape with weight decay requires both large overparametrization and initialization
by: Boursier, Etienne, et al.
Published: (2025)
by: Boursier, Etienne, et al.
Published: (2025)
Softmax Linear Attention: Reclaiming Global Competition
by: Xu, Mingwei, et al.
Published: (2026)
by: Xu, Mingwei, et al.
Published: (2026)
MetaLA: Unified Optimal Linear Approximation to Softmax Attention Map
by: Chou, Yuhong, et al.
Published: (2024)
by: Chou, Yuhong, et al.
Published: (2024)
Universal Approximation with Softmax Attention
by: Hu, Jerry Yao-Chieh, et al.
Published: (2025)
by: Hu, Jerry Yao-Chieh, et al.
Published: (2025)
Mildly Overparameterized ReLU Networks on Orthogonal Data: Incremental Learning and Implicit Bias
by: Town, James, et al.
Published: (2026)
by: Town, James, et al.
Published: (2026)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention
by: He, Jianliang, et al.
Published: (2025)
by: He, Jianliang, et al.
Published: (2025)
Scalable-Softmax Is Superior for Attention
by: Nakanishi, Ken M.
Published: (2025)
by: Nakanishi, Ken M.
Published: (2025)
On the Invariants of Softmax Attention
by: Lee, Wonsuk
Published: (2026)
by: Lee, Wonsuk
Published: (2026)
Approximate information maximization for bandit games
by: Barbier-Chebbah, Alex, et al.
Published: (2023)
by: Barbier-Chebbah, Alex, et al.
Published: (2023)
Softmax-free Linear Transformers
by: Lu, Jiachen, et al.
Published: (2022)
by: Lu, Jiachen, et al.
Published: (2022)
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
by: Zuhri, Zayd M. K., et al.
Published: (2025)
by: Zuhri, Zayd M. K., et al.
Published: (2025)
Softmax Attention with Constant Cost per Token
by: Heinsen, Franz A.
Published: (2024)
by: Heinsen, Franz A.
Published: (2024)
Attention layers provably solve single-location regression
by: Marion, Pierre, et al.
Published: (2024)
by: Marion, Pierre, et al.
Published: (2024)
Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective
by: Yan, Fanqi, et al.
Published: (2025)
by: Yan, Fanqi, et al.
Published: (2025)
Customizing the Inductive Biases of Softmax Attention using Structured Matrices
by: Kuang, Yilun, et al.
Published: (2025)
by: Kuang, Yilun, et al.
Published: (2025)
Forgetting Transformer: Softmax Attention with a Forget Gate
by: Lin, Zhixuan, et al.
Published: (2025)
by: Lin, Zhixuan, et al.
Published: (2025)
Online Decision-Focused Learning
by: Capitaine, Aymeric, et al.
Published: (2025)
by: Capitaine, Aymeric, et al.
Published: (2025)
Rethinking Attention: Polynomial Alternatives to Softmax in Transformers
by: Saratchandran, Hemanth, et al.
Published: (2024)
by: Saratchandran, Hemanth, et al.
Published: (2024)
Optimal Transport-based Conformal Prediction
by: Thurin, Gauthier, et al.
Published: (2025)
by: Thurin, Gauthier, et al.
Published: (2025)
Adaptive Sparse Softmax: An Effective and Efficient Softmax Variant
by: Lv, Qi, et al.
Published: (2025)
by: Lv, Qi, et al.
Published: (2025)
Beyond Softmax: A New Perspective on Gradient Bandits
by: Melo, Emerson, et al.
Published: (2025)
by: Melo, Emerson, et al.
Published: (2025)
Rethinking the Global Convergence of Softmax Policy Gradient with Linear Function Approximation
by: Lin, Max Qiushi, et al.
Published: (2025)
by: Lin, Max Qiushi, et al.
Published: (2025)
The Key to State Reduction in Linear Attention: A Rank-based Perspective
by: Nazari, Philipp, et al.
Published: (2026)
by: Nazari, Philipp, et al.
Published: (2026)
Minimalist Softmax Attention Provably Learns Constrained Boolean Functions
by: Hu, Jerry Yao-Chieh, et al.
Published: (2025)
by: Hu, Jerry Yao-Chieh, et al.
Published: (2025)
Similar Items
-
Early alignment in two-layer networks training is a two-edged sword
by: Boursier, Etienne, et al.
Published: (2024) -
Simplicity bias and optimization threshold in two-layer ReLU networks
by: Boursier, Etienne, et al.
Published: (2024) -
Penalising the biases in norm regularisation enforces sparsity
by: Boursier, Etienne, et al.
Published: (2023) -
Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning
by: Xie, Zixuan, et al.
Published: (2026) -
A survey on multi-player bandits
by: Boursier, Etienne, et al.
Published: (2022)