Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Davies, Alex O., Nzoyem, Roussel, Ajmeri, Nirav, Filho, Telmo M. Silva
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence Machine Learning I.2
Online Access:	https://arxiv.org/abs/2510.08009
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915541819588608
author	Davies, Alex O. Nzoyem, Roussel Ajmeri, Nirav Filho, Telmo M. Silva
author_facet	Davies, Alex O. Nzoyem, Roussel Ajmeri, Nirav Filho, Telmo M. Silva
contents	Recent research has extensively studied how large language models manipulate integers in specific arithmetic tasks, and on a more fundamental level, how they represent numeric values. These previous works have found that language model embeddings can be used to reconstruct the original values, however, they do not evaluate whether language models actually model continuous values as continuous. Using expected properties of the embedding space, including linear reconstruction and principal component analysis, we show that language models not only represent numeric spaces as non-continuous but also introduce significant noise. Using models from three major providers (OpenAI, Google Gemini and Voyage AI), we show that while reconstruction is possible with high fidelity ($R^2 \geq 0.95$), principal components only explain a minor share of variation within the embedding space. This indicates that many components within the embedding space are orthogonal to the simple numeric input space. Further, both linear reconstruction and explained variance suffer with increasing decimal precision, despite the ordinal nature of the input space being fundamentally unchanged. The findings of this work therefore have implications for the many areas where embedding models are used, in-particular where high numerical precision, large magnitudes or mixed-sign values are common.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_08009
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Language Models Do Not Embed Numbers Continuously Davies, Alex O. Nzoyem, Roussel Ajmeri, Nirav Filho, Telmo M. Silva Artificial Intelligence Machine Learning I.2 Recent research has extensively studied how large language models manipulate integers in specific arithmetic tasks, and on a more fundamental level, how they represent numeric values. These previous works have found that language model embeddings can be used to reconstruct the original values, however, they do not evaluate whether language models actually model continuous values as continuous. Using expected properties of the embedding space, including linear reconstruction and principal component analysis, we show that language models not only represent numeric spaces as non-continuous but also introduce significant noise. Using models from three major providers (OpenAI, Google Gemini and Voyage AI), we show that while reconstruction is possible with high fidelity ($R^2 \geq 0.95$), principal components only explain a minor share of variation within the embedding space. This indicates that many components within the embedding space are orthogonal to the simple numeric input space. Further, both linear reconstruction and explained variance suffer with increasing decimal precision, despite the ordinal nature of the input space being fundamentally unchanged. The findings of this work therefore have implications for the many areas where embedding models are used, in-particular where high numerical precision, large magnitudes or mixed-sign values are common.
title	Language Models Do Not Embed Numbers Continuously
topic	Artificial Intelligence Machine Learning I.2
url	https://arxiv.org/abs/2510.08009

Similar Items