Saved in:
Bibliographic Details
Main Authors: Zhang, Zongpu, Dash, Pranab, Hu, Y. Charlie, Xu, Qiang, Li, Jian, Guan, Haibing
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.02135
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913923678076928
author Zhang, Zongpu
Dash, Pranab
Hu, Y. Charlie
Xu, Qiang
Li, Jian
Guan, Haibing
author_facet Zhang, Zongpu
Dash, Pranab
Hu, Y. Charlie
Xu, Qiang
Li, Jian
Guan, Haibing
contents Large Language Models (LLMs) are increasingly being integrated into various applications and services running on billions of mobile devices. However, deploying LLMs on resource-limited mobile devices faces a significant challenge due to their high demand for computation, memory, and ultimately energy. While current LLM frameworks for mobile use three power-hungry components-CPU, GPU, and Memory-even when running primarily-GPU LLM models, optimized DVFS governors for CPU, GPU, and memory featured in modern mobile devices operate independently and are oblivious of each other. Motivated by the above observation, in this work, we first measure the energy-efficiency of a SOTA LLM framework consisting of various LLM models on mobile phones which showed the triplet mobile governors result in up to 40.4% longer prefilling and decoding latency compared to optimal combinations of CPU, GPU, and memory frequencies with the same energy consumption for sampled prefill and decode lengths. Second, we conduct an in-depth measurement study to uncover how the intricate interplay (or lack of) among the mobile governors cause the above inefficiency in LLM inference. Finally, based on these insights, we design FUSE - a unified energy-aware governor for optimizing the energy efficiency of LLM inference on mobile devices. Our evaluation using a ShareGPT dataset shows FUSE reduces the time-to-first-token and time-per-output-token latencies by 7.0%-16.9% and 25.4%-36.8% on average with the same energy-per-token for various mobile LLM models.
format Preprint
id arxiv_https___arxiv_org_abs_2507_02135
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency
Zhang, Zongpu
Dash, Pranab
Hu, Y. Charlie
Xu, Qiang
Li, Jian
Guan, Haibing
Operating Systems
Computation and Language
Large Language Models (LLMs) are increasingly being integrated into various applications and services running on billions of mobile devices. However, deploying LLMs on resource-limited mobile devices faces a significant challenge due to their high demand for computation, memory, and ultimately energy. While current LLM frameworks for mobile use three power-hungry components-CPU, GPU, and Memory-even when running primarily-GPU LLM models, optimized DVFS governors for CPU, GPU, and memory featured in modern mobile devices operate independently and are oblivious of each other. Motivated by the above observation, in this work, we first measure the energy-efficiency of a SOTA LLM framework consisting of various LLM models on mobile phones which showed the triplet mobile governors result in up to 40.4% longer prefilling and decoding latency compared to optimal combinations of CPU, GPU, and memory frequencies with the same energy consumption for sampled prefill and decode lengths. Second, we conduct an in-depth measurement study to uncover how the intricate interplay (or lack of) among the mobile governors cause the above inefficiency in LLM inference. Finally, based on these insights, we design FUSE - a unified energy-aware governor for optimizing the energy efficiency of LLM inference on mobile devices. Our evaluation using a ShareGPT dataset shows FUSE reduces the time-to-first-token and time-per-output-token latencies by 7.0%-16.9% and 25.4%-36.8% on average with the same energy-per-token for various mobile LLM models.
title Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency
topic Operating Systems
Computation and Language
url https://arxiv.org/abs/2507.02135