Saved in:
Bibliographic Details
Main Authors: Zhao, Yize, Thrampoulidis, Christos
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.08348
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915537630527488
author Zhao, Yize
Thrampoulidis, Christos
author_facet Zhao, Yize
Thrampoulidis, Christos
contents We investigate how next-token prediction (NTP) optimization leads language models to extract and organize semantic structure from text. Our analysis, based on a tractable mathematical model and controlled synthetic data, reveals that NTP implicitly guides models to factor a centered support matrix encoding context-to-next-token co-occurrence patterns via singular value decomposition (SVD). While models never explicitly construct this matrix, learned word and context embeddings converge to its SVD factors, with singular vectors encoding latent semantic concepts through their sign patterns. We demonstrate that concepts corresponding to larger singular values are learned earlier during training, yielding a natural semantic hierarchy where broad categories emerge before fine-grained ones. This insight motivates orthant-based clustering, a method that combines concept signs to identify interpretable semantic categories. We validate our findings on synthetic datasets and pretrained language models, recovering diverse semantic structures such as grammatical categories, named entity types, and topical distinctions (medical, entertainment). Our work bridges classical distributional semantics and neural collapse geometry, characterizing how gradient-based optimization implicitly determines both the matrix representation and factorization method that encode semantic structure.
format Preprint
id arxiv_https___arxiv_org_abs_2505_08348
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Geometry of Semantics in Next-Token Prediction: How Optimization Implicitly Organizes Linguistic Representations
Zhao, Yize
Thrampoulidis, Christos
Computation and Language
We investigate how next-token prediction (NTP) optimization leads language models to extract and organize semantic structure from text. Our analysis, based on a tractable mathematical model and controlled synthetic data, reveals that NTP implicitly guides models to factor a centered support matrix encoding context-to-next-token co-occurrence patterns via singular value decomposition (SVD). While models never explicitly construct this matrix, learned word and context embeddings converge to its SVD factors, with singular vectors encoding latent semantic concepts through their sign patterns. We demonstrate that concepts corresponding to larger singular values are learned earlier during training, yielding a natural semantic hierarchy where broad categories emerge before fine-grained ones. This insight motivates orthant-based clustering, a method that combines concept signs to identify interpretable semantic categories. We validate our findings on synthetic datasets and pretrained language models, recovering diverse semantic structures such as grammatical categories, named entity types, and topical distinctions (medical, entertainment). Our work bridges classical distributional semantics and neural collapse geometry, characterizing how gradient-based optimization implicitly determines both the matrix representation and factorization method that encode semantic structure.
title Geometry of Semantics in Next-Token Prediction: How Optimization Implicitly Organizes Linguistic Representations
topic Computation and Language
url https://arxiv.org/abs/2505.08348