Saved in:
Bibliographic Details
Main Authors: Budzinskiy, Stanislav, Gloser, Marian, Yilmaz, Tolunay, Tham, Ying Hong, Lin, Yuanyi, Fang, Wenyi, Wu, Fan, Petersen, Philipp
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.21623
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918487460413440
author Budzinskiy, Stanislav
Gloser, Marian
Yilmaz, Tolunay
Tham, Ying Hong
Lin, Yuanyi
Fang, Wenyi
Wu, Fan
Petersen, Philipp
author_facet Budzinskiy, Stanislav
Gloser, Marian
Yilmaz, Tolunay
Tham, Ying Hong
Lin, Yuanyi
Fang, Wenyi
Wu, Fan
Petersen, Philipp
contents Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition $f(g(\mathrm{x}))$, we provide an adaptive strategy that selects a small subset of components of $g(\mathrm{x})$ to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy.
format Preprint
id arxiv_https___arxiv_org_abs_2601_21623
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models
Budzinskiy, Stanislav
Gloser, Marian
Yilmaz, Tolunay
Tham, Ying Hong
Lin, Yuanyi
Fang, Wenyi
Wu, Fan
Petersen, Philipp
Machine Learning
Numerical Analysis
Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition $f(g(\mathrm{x}))$, we provide an adaptive strategy that selects a small subset of components of $g(\mathrm{x})$ to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy.
title LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models
topic Machine Learning
Numerical Analysis
url https://arxiv.org/abs/2601.21623