:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Boursier, Etienne, Boyer, Claire
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2512.11784
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Early alignment in two-layer networks training is a two-edged sword
by: Boursier, Etienne, et al.
Published: (2024)

Simplicity bias and optimization threshold in two-layer ReLU networks
by: Boursier, Etienne, et al.
Published: (2024)

Penalising the biases in norm regularisation enforces sparsity
by: Boursier, Etienne, et al.
Published: (2023)

Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning
by: Xie, Zixuan, et al.
Published: (2026)

A survey on multi-player bandits
by: Boursier, Etienne, et al.
Published: (2022)

Why Softmax Attention Outperforms Linear Attention
by: Deng, Yichuan, et al.
Published: (2023)

Attention-based PCA
by: Maulen-Soto, Rodrigo, et al.
Published: (2026)

Attention-based clustering
by: Maulen-Soto, Rodrigo, et al.
Published: (2025)

Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs
by: Boursier, Etienne, et al.
Published: (2022)

First-order ANIL provably learns representations despite overparametrization
by: Yüksel, Oğuz Kaan, et al.
Published: (2023)

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry
by: Zhang, Michael, et al.
Published: (2024)

Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency
by: Nishikawa, Naoki, et al.
Published: (2025)

Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression
by: Zuo, Yifei, et al.
Published: (2025)

On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective
by: Mongaras, Gabriel, et al.
Published: (2025)

A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation
by: Boursier, Etienne, et al.
Published: (2025)

Statistical Advantage of Softmax Attention: Insights from Single-Location Regression
by: Duranthon, O., et al.
Published: (2025)

Benignity of loss landscape with weight decay requires both large overparametrization and initialization
by: Boursier, Etienne, et al.
Published: (2025)

Softmax Linear Attention: Reclaiming Global Competition
by: Xu, Mingwei, et al.
Published: (2026)

MetaLA: Unified Optimal Linear Approximation to Softmax Attention Map
by: Chou, Yuhong, et al.
Published: (2024)

Universal Approximation with Softmax Attention
by: Hu, Jerry Yao-Chieh, et al.
Published: (2025)

Mildly Overparameterized ReLU Networks on Orthogonal Data: Incremental Learning and Implicit Bias
by: Town, James, et al.
Published: (2026)

In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention
by: He, Jianliang, et al.
Published: (2025)

Scalable-Softmax Is Superior for Attention
by: Nakanishi, Ken M.
Published: (2025)

On the Invariants of Softmax Attention
by: Lee, Wonsuk
Published: (2026)

Approximate information maximization for bandit games
by: Barbier-Chebbah, Alex, et al.
Published: (2023)

Softmax-free Linear Transformers
by: Lu, Jiachen, et al.
Published: (2022)

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
by: Zuhri, Zayd M. K., et al.
Published: (2025)

Softmax Attention with Constant Cost per Token
by: Heinsen, Franz A.
Published: (2024)

Attention layers provably solve single-location regression
by: Marion, Pierre, et al.
Published: (2024)

Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective
by: Yan, Fanqi, et al.
Published: (2025)

Customizing the Inductive Biases of Softmax Attention using Structured Matrices
by: Kuang, Yilun, et al.
Published: (2025)

Forgetting Transformer: Softmax Attention with a Forget Gate
by: Lin, Zhixuan, et al.
Published: (2025)

Online Decision-Focused Learning
by: Capitaine, Aymeric, et al.
Published: (2025)

Rethinking Attention: Polynomial Alternatives to Softmax in Transformers
by: Saratchandran, Hemanth, et al.
Published: (2024)

Optimal Transport-based Conformal Prediction
by: Thurin, Gauthier, et al.
Published: (2025)

Adaptive Sparse Softmax: An Effective and Efficient Softmax Variant
by: Lv, Qi, et al.
Published: (2025)

Beyond Softmax: A New Perspective on Gradient Bandits
by: Melo, Emerson, et al.
Published: (2025)

Rethinking the Global Convergence of Softmax Policy Gradient with Linear Function Approximation
by: Lin, Max Qiushi, et al.
Published: (2025)

The Key to State Reduction in Linear Attention: A Rank-based Perspective
by: Nazari, Philipp, et al.
Published: (2026)

Minimalist Softmax Attention Provably Learns Constrained Boolean Functions
by: Hu, Jerry Yao-Chieh, et al.
Published: (2025)