Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.21623 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866918487460413440 |
|---|---|
| author | Budzinskiy, Stanislav Gloser, Marian Yilmaz, Tolunay Tham, Ying Hong Lin, Yuanyi Fang, Wenyi Wu, Fan Petersen, Philipp |
| author_facet | Budzinskiy, Stanislav Gloser, Marian Yilmaz, Tolunay Tham, Ying Hong Lin, Yuanyi Fang, Wenyi Wu, Fan Petersen, Philipp |
| contents | Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition $f(g(\mathrm{x}))$, we provide an adaptive strategy that selects a small subset of components of $g(\mathrm{x})$ to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2601_21623 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models Budzinskiy, Stanislav Gloser, Marian Yilmaz, Tolunay Tham, Ying Hong Lin, Yuanyi Fang, Wenyi Wu, Fan Petersen, Philipp Machine Learning Numerical Analysis Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition $f(g(\mathrm{x}))$, we provide an adaptive strategy that selects a small subset of components of $g(\mathrm{x})$ to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy. |
| title | LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models |
| topic | Machine Learning Numerical Analysis |
| url | https://arxiv.org/abs/2601.21623 |