Saved in:
Bibliographic Details
Main Authors: Song, Qingyu, Liu, Rui, Lin, Wei, Liao, Peiyu, Zhao, Wenqian, Wang, Yiwen, Hu, Shoubo, Jiang, Yining, Long, Mochun, Zhen, Hui-Ling, Jiang, Ning, Yuan, Mingxuan, Xiang, Qiao, Xu, Hong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.15030
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Deploying Large Language Models (LLMs) on edge devices enhances privacy but faces performance hurdles due to limited resources. We introduce a systematic methodology to evaluate on-device LLMs, balancing capability, efficiency, and resource constraints. Through an extensive analysis of models (0.5B-14B) and seven post-training quantization (PTQ) methods on commodity hardware, we demonstrate that: 1) Heavily quantized large models consistently outperform smaller, high-precision models, with a performance threshold at ~3.5 effective bits-per-weight (BPW); 2) Resource utilization scales linearly with BPW, though power and memory footprints vary by quantization algorithm; and 3) With a reduction in model size, the primary constraint on throughput transitions from communication overhead to computational latency. We conclude by offering guidelines for optimizing LLMs in resource-constrained edge environments. Our codebase is available at https://anonymous.4open.science/r/LLMOnDevice/.