Saved in:
Bibliographic Details
Main Authors: Ruscio, Valeria, Nanni, Umberto, Silvestri, Fabrizio
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2508.02546
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908477975166976
author Ruscio, Valeria
Nanni, Umberto
Silvestri, Fabrizio
author_facet Ruscio, Valeria
Nanni, Umberto
Silvestri, Fabrizio
contents Attention sink (AS) is a consistent pattern in transformer attention maps where certain tokens (often special tokens or positional anchors) disproportionately attract attention from other tokens. We show that in transformers, AS is not an architectural artifact, but it is the manifestation of a fundamental geometric principle: the establishment of reference frames that anchor representational spaces. We analyze several architectures and identify three distinct reference frame types, centralized, distributed, and bidirectional, that correlate with the attention sink phenomenon. We show that they emerge during the earliest stages of training as optimal solutions to the problem of establishing stable coordinate systems in high-dimensional spaces. We show the influence of architecture components, particularly position encoding implementations, on the specific type of reference frame. This perspective transforms our understanding of transformer attention mechanisms and provides insights for both architecture design and the relationship with AS.
format Preprint
id arxiv_https___arxiv_org_abs_2508_02546
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle What are you sinking? A geometric approach on attention sink
Ruscio, Valeria
Nanni, Umberto
Silvestri, Fabrizio
Machine Learning
Artificial Intelligence
Computation and Language
Attention sink (AS) is a consistent pattern in transformer attention maps where certain tokens (often special tokens or positional anchors) disproportionately attract attention from other tokens. We show that in transformers, AS is not an architectural artifact, but it is the manifestation of a fundamental geometric principle: the establishment of reference frames that anchor representational spaces. We analyze several architectures and identify three distinct reference frame types, centralized, distributed, and bidirectional, that correlate with the attention sink phenomenon. We show that they emerge during the earliest stages of training as optimal solutions to the problem of establishing stable coordinate systems in high-dimensional spaces. We show the influence of architecture components, particularly position encoding implementations, on the specific type of reference frame. This perspective transforms our understanding of transformer attention mechanisms and provides insights for both architecture design and the relationship with AS.
title What are you sinking? A geometric approach on attention sink
topic Machine Learning
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2508.02546