Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Agarwal, Deepak, Mavani, Dhyey Dharmendrakumar, Gupta, Suyash, Sethuraman, Karthik, Dharamsi, Tejas
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Probability Statistics Theory I.2.7; G.3; G.4
Online Access:	https://arxiv.org/abs/2602.22271
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Self-attention is usually described as a flexible, content-adaptive way to mix a token with information from its past. We reinterpret causal self-attention transformers, the backbone of modern foundation models, within a probabilistic framework, much as classical PCA is extended to probabilistic PCA. This reformulation reveals a key structural consequence of the underlying change of variables: a barrier constraint emerges on the parameters of self-attention. The resulting geometry exposes a degeneracy boundary where the attention-induced mapping becomes locally ill-conditioned, yielding a stability-margin interpretation analogous to the margin in support vector machines. This, in turn, naturally gives rise to the concept of support tokens. We further show that causal transformers define a consistent stochastic process over infinite token sequences, providing a rigorous probabilistic foundation for sequence modeling. Building on this view, we derive a Bayesian MAP training objective that requires only a minimal modification to standard LLM training: adding a smooth log-barrier penalty to the usual cross-entropy loss. Empirically, the resulting training objective improves robustness to input perturbations and sharpens the margin geometry of the learned representations without sacrificing out-of-sample accuracy.

Similar Items