Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.22271 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Table of Contents:
- Self-attention is usually described as a flexible, content-adaptive way to mix a token with information from its past. We reinterpret causal self-attention transformers, the backbone of modern foundation models, within a probabilistic framework, much as classical PCA is extended to probabilistic PCA. This reformulation reveals a key structural consequence of the underlying change of variables: a barrier constraint emerges on the parameters of self-attention. The resulting geometry exposes a degeneracy boundary where the attention-induced mapping becomes locally ill-conditioned, yielding a stability-margin interpretation analogous to the margin in support vector machines. This, in turn, naturally gives rise to the concept of support tokens. We further show that causal transformers define a consistent stochastic process over infinite token sequences, providing a rigorous probabilistic foundation for sequence modeling. Building on this view, we derive a Bayesian MAP training objective that requires only a minimal modification to standard LLM training: adding a smooth log-barrier penalty to the usual cross-entropy loss. Empirically, the resulting training objective improves robustness to input perturbations and sharpens the margin geometry of the learned representations without sacrificing out-of-sample accuracy.