Gespeichert in:
Bibliographische Detailangaben
1. Verfasser: Mehta, Nihal
Format: Preprint
Veröffentlicht: 2025
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2511.13780
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866909909403041792
author Mehta, Nihal
author_facet Mehta, Nihal
contents This paper presents a mathematical interpretation of self-attention by connecting it to distributional semantics principles. We show that self-attention emerges from projecting corpus-level co-occurrence statistics into sequence context. Starting from the co-occurrence matrix underlying GloVe embeddings, we demonstrate how the projection naturally captures contextual influence, with the query-key-value mechanism arising as the natural asymmetric extension for modeling directional relationships. Positional encodings and multi-head attention then follow as structured refinements of this same projection principle. Our analysis demonstrates that the Transformer architecture's particular algebraic form follows from these projection principles rather than being an arbitrary design choice.
format Preprint
id arxiv_https___arxiv_org_abs_2511_13780
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Self-Attention as Distributional Projection: A Unified Interpretation of Transformer Architecture
Mehta, Nihal
Machine Learning
This paper presents a mathematical interpretation of self-attention by connecting it to distributional semantics principles. We show that self-attention emerges from projecting corpus-level co-occurrence statistics into sequence context. Starting from the co-occurrence matrix underlying GloVe embeddings, we demonstrate how the projection naturally captures contextual influence, with the query-key-value mechanism arising as the natural asymmetric extension for modeling directional relationships. Positional encodings and multi-head attention then follow as structured refinements of this same projection principle. Our analysis demonstrates that the Transformer architecture's particular algebraic form follows from these projection principles rather than being an arbitrary design choice.
title Self-Attention as Distributional Projection: A Unified Interpretation of Transformer Architecture
topic Machine Learning
url https://arxiv.org/abs/2511.13780