Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wilkins, Grant, Keshav, Srinivasan, Mortier, Richard
Format:	Preprint
Published:	2024
Subjects:	Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2407.04014
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909242742538240
author	Wilkins, Grant Keshav, Srinivasan Mortier, Richard
author_facet	Wilkins, Grant Keshav, Srinivasan Mortier, Richard
contents	The rapid adoption of large language models (LLMs) has led to significant advances in natural language processing and text generation. However, the energy consumed through LLM model inference remains a major challenge for sustainable AI deployment. To address this problem, we model the workload-dependent energy consumption and runtime of LLM inference tasks on heterogeneous GPU-CPU systems. By conducting an extensive characterization study of several state-of-the-art LLMs and analyzing their energy and runtime behavior across different magnitudes of input prompts and output text, we develop accurate (R^2>0.96) energy and runtime models for each LLM. We employ these models to explore an offline, energy-optimal LLM workload scheduling framework. Through a case study, we demonstrate the advantages of energy and accuracy aware scheduling compared to existing best practices.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_04014
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems Wilkins, Grant Keshav, Srinivasan Mortier, Richard Distributed, Parallel, and Cluster Computing The rapid adoption of large language models (LLMs) has led to significant advances in natural language processing and text generation. However, the energy consumed through LLM model inference remains a major challenge for sustainable AI deployment. To address this problem, we model the workload-dependent energy consumption and runtime of LLM inference tasks on heterogeneous GPU-CPU systems. By conducting an extensive characterization study of several state-of-the-art LLMs and analyzing their energy and runtime behavior across different magnitudes of input prompts and output text, we develop accurate (R^2>0.96) energy and runtime models for each LLM. We employ these models to explore an offline, energy-optimal LLM workload scheduling framework. Through a case study, we demonstrate the advantages of energy and accuracy aware scheduling compared to existing best practices.
title	Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems
topic	Distributed, Parallel, and Cluster Computing
url	https://arxiv.org/abs/2407.04014

Similar Items