Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Zongpu, Dash, Pranab, Hu, Y. Charlie, Xu, Qiang, Li, Jian, Guan, Haibing
Format:	Preprint
Published:	2025
Subjects:	Operating Systems Computation and Language
Online Access:	https://arxiv.org/abs/2507.02135
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913923678076928
author	Zhang, Zongpu Dash, Pranab Hu, Y. Charlie Xu, Qiang Li, Jian Guan, Haibing
author_facet	Zhang, Zongpu Dash, Pranab Hu, Y. Charlie Xu, Qiang Li, Jian Guan, Haibing
contents	Large Language Models (LLMs) are increasingly being integrated into various applications and services running on billions of mobile devices. However, deploying LLMs on resource-limited mobile devices faces a significant challenge due to their high demand for computation, memory, and ultimately energy. While current LLM frameworks for mobile use three power-hungry components-CPU, GPU, and Memory-even when running primarily-GPU LLM models, optimized DVFS governors for CPU, GPU, and memory featured in modern mobile devices operate independently and are oblivious of each other. Motivated by the above observation, in this work, we first measure the energy-efficiency of a SOTA LLM framework consisting of various LLM models on mobile phones which showed the triplet mobile governors result in up to 40.4% longer prefilling and decoding latency compared to optimal combinations of CPU, GPU, and memory frequencies with the same energy consumption for sampled prefill and decode lengths. Second, we conduct an in-depth measurement study to uncover how the intricate interplay (or lack of) among the mobile governors cause the above inefficiency in LLM inference. Finally, based on these insights, we design FUSE - a unified energy-aware governor for optimizing the energy efficiency of LLM inference on mobile devices. Our evaluation using a ShareGPT dataset shows FUSE reduces the time-to-first-token and time-per-output-token latencies by 7.0%-16.9% and 25.4%-36.8% on average with the same energy-per-token for various mobile LLM models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_02135
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency Zhang, Zongpu Dash, Pranab Hu, Y. Charlie Xu, Qiang Li, Jian Guan, Haibing Operating Systems Computation and Language Large Language Models (LLMs) are increasingly being integrated into various applications and services running on billions of mobile devices. However, deploying LLMs on resource-limited mobile devices faces a significant challenge due to their high demand for computation, memory, and ultimately energy. While current LLM frameworks for mobile use three power-hungry components-CPU, GPU, and Memory-even when running primarily-GPU LLM models, optimized DVFS governors for CPU, GPU, and memory featured in modern mobile devices operate independently and are oblivious of each other. Motivated by the above observation, in this work, we first measure the energy-efficiency of a SOTA LLM framework consisting of various LLM models on mobile phones which showed the triplet mobile governors result in up to 40.4% longer prefilling and decoding latency compared to optimal combinations of CPU, GPU, and memory frequencies with the same energy consumption for sampled prefill and decode lengths. Second, we conduct an in-depth measurement study to uncover how the intricate interplay (or lack of) among the mobile governors cause the above inefficiency in LLM inference. Finally, based on these insights, we design FUSE - a unified energy-aware governor for optimizing the energy efficiency of LLM inference on mobile devices. Our evaluation using a ShareGPT dataset shows FUSE reduces the time-to-first-token and time-per-output-token latencies by 7.0%-16.9% and 25.4%-36.8% on average with the same energy-per-token for various mobile LLM models.
title	Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency
topic	Operating Systems Computation and Language
url	https://arxiv.org/abs/2507.02135

Similar Items