Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kodavanti, Sravanth, Vajrala, Sowmya, Miriyala, Srinivas, Tiwari, Utsav, Kumar, Uttam, Mahawar, Utkarsh Kumar, Singh, Achal Pratap, D, Arya, Mutyala, Narendra, Rajendiran, Vikram Nelvoy, Allur, Sharan Kumar, Lee, Euntaik, Kim, Dohyoung, Lee, HyeonSu, Cho, Gyusung, Kim, JungBae
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2604.18655
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910161654775808
author	Kodavanti, Sravanth Vajrala, Sowmya Miriyala, Srinivas Tiwari, Utsav Kumar, Uttam Mahawar, Utkarsh Kumar Singh, Achal Pratap D, Arya Mutyala, Narendra Rajendiran, Vikram Nelvoy Allur, Sharan Kumar Lee, Euntaik Kim, Dohyoung Lee, HyeonSu Cho, Gyusung Kim, JungBae
author_facet	Kodavanti, Sravanth Vajrala, Sowmya Miriyala, Srinivas Tiwari, Utsav Kumar, Uttam Mahawar, Utkarsh Kumar Singh, Achal Pratap D, Arya Mutyala, Narendra Rajendiran, Vikram Nelvoy Allur, Sharan Kumar Lee, Euntaik Kim, Dohyoung Lee, HyeonSu Cho, Gyusung Kim, JungBae
contents	Deploying large language models (LLMs) on smartphones poses significant engineering challenges due to stringent constraints on memory, latency, and runtime flexibility. In this work, we present a hardware-aware framework for efficient on-device inference of a LLaMA-based multilingual foundation model supporting multiple use cases on Samsung Galaxy S24 and S25 devices with SM8650 and SM8750 Qualcomm chipsets respectively. Our approach integrates application-specific LoRAs as runtime inputs to a single frozen inference graph, enabling dynamic task switching without recompilation or memory overhead. We further introduce a multi-stream decoding mechanism that concurrently generates stylistic variations - such as formal, polite, or jovial responses - within a single forward pass, reducing latency by up to 6x. To accelerate token generation, we apply Dynamic Self-Speculative Decoding (DS2D), a tree-based strategy that predicts future tokens without requiring a draft model, yielding up to 2.3x speedup in decode time. Combined with quantization to INT4 and architecture-level optimizations, our system achieves 4-6x overall improvements in memory and latency while maintaining accuracy across 9 languages and 8 tasks. These results demonstrate practical feasibility of deploying multi-use-case LLMs on edge devices, advancing the commercial viability of Generative AI in mobile platforms.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_18655
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM Kodavanti, Sravanth Vajrala, Sowmya Miriyala, Srinivas Tiwari, Utsav Kumar, Uttam Mahawar, Utkarsh Kumar Singh, Achal Pratap D, Arya Mutyala, Narendra Rajendiran, Vikram Nelvoy Allur, Sharan Kumar Lee, Euntaik Kim, Dohyoung Lee, HyeonSu Cho, Gyusung Kim, JungBae Distributed, Parallel, and Cluster Computing Artificial Intelligence Computation and Language Deploying large language models (LLMs) on smartphones poses significant engineering challenges due to stringent constraints on memory, latency, and runtime flexibility. In this work, we present a hardware-aware framework for efficient on-device inference of a LLaMA-based multilingual foundation model supporting multiple use cases on Samsung Galaxy S24 and S25 devices with SM8650 and SM8750 Qualcomm chipsets respectively. Our approach integrates application-specific LoRAs as runtime inputs to a single frozen inference graph, enabling dynamic task switching without recompilation or memory overhead. We further introduce a multi-stream decoding mechanism that concurrently generates stylistic variations - such as formal, polite, or jovial responses - within a single forward pass, reducing latency by up to 6x. To accelerate token generation, we apply Dynamic Self-Speculative Decoding (DS2D), a tree-based strategy that predicts future tokens without requiring a draft model, yielding up to 2.3x speedup in decode time. Combined with quantization to INT4 and architecture-level optimizations, our system achieves 4-6x overall improvements in memory and latency while maintaining accuracy across 9 languages and 8 tasks. These results demonstrate practical feasibility of deploying multi-use-case LLMs on edge devices, advancing the commercial viability of Generative AI in mobile platforms.
title	Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
topic	Distributed, Parallel, and Cluster Computing Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2604.18655

Similar Items