Saved in:
Bibliographic Details
Main Authors: Kodavanti, Sravanth, Vajrala, Sowmya, Miriyala, Srinivas, Tiwari, Utsav, Kumar, Uttam, Mahawar, Utkarsh Kumar, Singh, Achal Pratap, D, Arya, Mutyala, Narendra, Rajendiran, Vikram Nelvoy, Allur, Sharan Kumar, Lee, Euntaik, Kim, Dohyoung, Lee, HyeonSu, Cho, Gyusung, Kim, JungBae
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.18655
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910161654775808
author Kodavanti, Sravanth
Vajrala, Sowmya
Miriyala, Srinivas
Tiwari, Utsav
Kumar, Uttam
Mahawar, Utkarsh Kumar
Singh, Achal Pratap
D, Arya
Mutyala, Narendra
Rajendiran, Vikram Nelvoy
Allur, Sharan Kumar
Lee, Euntaik
Kim, Dohyoung
Lee, HyeonSu
Cho, Gyusung
Kim, JungBae
author_facet Kodavanti, Sravanth
Vajrala, Sowmya
Miriyala, Srinivas
Tiwari, Utsav
Kumar, Uttam
Mahawar, Utkarsh Kumar
Singh, Achal Pratap
D, Arya
Mutyala, Narendra
Rajendiran, Vikram Nelvoy
Allur, Sharan Kumar
Lee, Euntaik
Kim, Dohyoung
Lee, HyeonSu
Cho, Gyusung
Kim, JungBae
contents Deploying large language models (LLMs) on smartphones poses significant engineering challenges due to stringent constraints on memory, latency, and runtime flexibility. In this work, we present a hardware-aware framework for efficient on-device inference of a LLaMA-based multilingual foundation model supporting multiple use cases on Samsung Galaxy S24 and S25 devices with SM8650 and SM8750 Qualcomm chipsets respectively. Our approach integrates application-specific LoRAs as runtime inputs to a single frozen inference graph, enabling dynamic task switching without recompilation or memory overhead. We further introduce a multi-stream decoding mechanism that concurrently generates stylistic variations - such as formal, polite, or jovial responses - within a single forward pass, reducing latency by up to 6x. To accelerate token generation, we apply Dynamic Self-Speculative Decoding (DS2D), a tree-based strategy that predicts future tokens without requiring a draft model, yielding up to 2.3x speedup in decode time. Combined with quantization to INT4 and architecture-level optimizations, our system achieves 4-6x overall improvements in memory and latency while maintaining accuracy across 9 languages and 8 tasks. These results demonstrate practical feasibility of deploying multi-use-case LLMs on edge devices, advancing the commercial viability of Generative AI in mobile platforms.
format Preprint
id arxiv_https___arxiv_org_abs_2604_18655
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
Kodavanti, Sravanth
Vajrala, Sowmya
Miriyala, Srinivas
Tiwari, Utsav
Kumar, Uttam
Mahawar, Utkarsh Kumar
Singh, Achal Pratap
D, Arya
Mutyala, Narendra
Rajendiran, Vikram Nelvoy
Allur, Sharan Kumar
Lee, Euntaik
Kim, Dohyoung
Lee, HyeonSu
Cho, Gyusung
Kim, JungBae
Distributed, Parallel, and Cluster Computing
Artificial Intelligence
Computation and Language
Deploying large language models (LLMs) on smartphones poses significant engineering challenges due to stringent constraints on memory, latency, and runtime flexibility. In this work, we present a hardware-aware framework for efficient on-device inference of a LLaMA-based multilingual foundation model supporting multiple use cases on Samsung Galaxy S24 and S25 devices with SM8650 and SM8750 Qualcomm chipsets respectively. Our approach integrates application-specific LoRAs as runtime inputs to a single frozen inference graph, enabling dynamic task switching without recompilation or memory overhead. We further introduce a multi-stream decoding mechanism that concurrently generates stylistic variations - such as formal, polite, or jovial responses - within a single forward pass, reducing latency by up to 6x. To accelerate token generation, we apply Dynamic Self-Speculative Decoding (DS2D), a tree-based strategy that predicts future tokens without requiring a draft model, yielding up to 2.3x speedup in decode time. Combined with quantization to INT4 and architecture-level optimizations, our system achieves 4-6x overall improvements in memory and latency while maintaining accuracy across 9 languages and 8 tasks. These results demonstrate practical feasibility of deploying multi-use-case LLMs on edge devices, advancing the commercial viability of Generative AI in mobile platforms.
title Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
topic Distributed, Parallel, and Cluster Computing
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2604.18655