Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Delavande, Julien, Pierrard, Regis, Luccioni, Sasha
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2601.22362
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911408667492352
author	Delavande, Julien Pierrard, Regis Luccioni, Sasha
author_facet	Delavande, Julien Pierrard, Regis Luccioni, Sasha
contents	Large Language Models (LLMs) are increasingly deployed in production, contributing towards shifting the burden in terms of computational resources and energy demands from training to inference. While prior work has examined the energy cost of inference per prompt or per token, we highlight how \emph{system-level design choices} - such as numerical precision, batching strategy, and request scheduling - can lead to orders-of-magnitude differences in energy consumption for the same model. We perform a detailed empirical study of LLM inference energy and latency on NVIDIA H100 GPUs, analyzing the impact of quantization, batch size, and serving configuration (e.g., with Hugging Face's Text Generation Inference server). Our results reveal that lower-precision formats only yield energy gains in compute-bound regimes; that batching improves energy efficiency, especially in memory-bound phases like decoding; and that structured request timing (arrival shaping) can reduce per-request energy by up to 100 times. We argue that sustainable LLM deployment depends not only on model internals, but also on the orchestration of the serving stack. Our findings motivate phase-aware energy profiling and system-level optimizations for greener AI services.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_22362
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Understanding Efficiency: Quantization, Batching, and Serving Strategies in LLM Energy Use Delavande, Julien Pierrard, Regis Luccioni, Sasha Machine Learning Large Language Models (LLMs) are increasingly deployed in production, contributing towards shifting the burden in terms of computational resources and energy demands from training to inference. While prior work has examined the energy cost of inference per prompt or per token, we highlight how \emph{system-level design choices} - such as numerical precision, batching strategy, and request scheduling - can lead to orders-of-magnitude differences in energy consumption for the same model. We perform a detailed empirical study of LLM inference energy and latency on NVIDIA H100 GPUs, analyzing the impact of quantization, batch size, and serving configuration (e.g., with Hugging Face's Text Generation Inference server). Our results reveal that lower-precision formats only yield energy gains in compute-bound regimes; that batching improves energy efficiency, especially in memory-bound phases like decoding; and that structured request timing (arrival shaping) can reduce per-request energy by up to 100 times. We argue that sustainable LLM deployment depends not only on model internals, but also on the orchestration of the serving stack. Our findings motivate phase-aware energy profiling and system-level optimizations for greener AI services.
title	Understanding Efficiency: Quantization, Batching, and Serving Strategies in LLM Energy Use
topic	Machine Learning
url	https://arxiv.org/abs/2601.22362

Similar Items