Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
1. Verfasser:	Mehta, Nihal
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Machine Learning
Online-Zugang:	https://arxiv.org/abs/2511.13780
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866909909403041792
author	Mehta, Nihal
author_facet	Mehta, Nihal
contents	This paper presents a mathematical interpretation of self-attention by connecting it to distributional semantics principles. We show that self-attention emerges from projecting corpus-level co-occurrence statistics into sequence context. Starting from the co-occurrence matrix underlying GloVe embeddings, we demonstrate how the projection naturally captures contextual influence, with the query-key-value mechanism arising as the natural asymmetric extension for modeling directional relationships. Positional encodings and multi-head attention then follow as structured refinements of this same projection principle. Our analysis demonstrates that the Transformer architecture's particular algebraic form follows from these projection principles rather than being an arbitrary design choice.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_13780
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Self-Attention as Distributional Projection: A Unified Interpretation of Transformer Architecture Mehta, Nihal Machine Learning This paper presents a mathematical interpretation of self-attention by connecting it to distributional semantics principles. We show that self-attention emerges from projecting corpus-level co-occurrence statistics into sequence context. Starting from the co-occurrence matrix underlying GloVe embeddings, we demonstrate how the projection naturally captures contextual influence, with the query-key-value mechanism arising as the natural asymmetric extension for modeling directional relationships. Positional encodings and multi-head attention then follow as structured refinements of this same projection principle. Our analysis demonstrates that the Transformer architecture's particular algebraic form follows from these projection principles rather than being an arbitrary design choice.
title	Self-Attention as Distributional Projection: A Unified Interpretation of Transformer Architecture
topic	Machine Learning
url	https://arxiv.org/abs/2511.13780

Ähnliche Einträge