Saved in:
Bibliographic Details
Main Authors: Levine, Reese, Sharma, Rithik, Jain, Nikhil, Ramesh, Abhijit, Chen, Zheyuan, Abbas, Neha, Contini, James, Sorensen, Tyler
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.20706
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910240554876928
author Levine, Reese
Sharma, Rithik
Jain, Nikhil
Ramesh, Abhijit
Chen, Zheyuan
Abbas, Neha
Contini, James
Sorensen, Tyler
author_facet Levine, Reese
Sharma, Rithik
Jain, Nikhil
Ramesh, Abhijit
Chen, Zheyuan
Abbas, Neha
Contini, James
Sorensen, Tyler
contents Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets. To realize this opportunity, we present Llamas on the Web (LlamaWeb), a WebGPU backend for llama$.$cpp that enables memory-efficient and performance-portable LLM inference across a wide range of model weight formats in the browser. Our design significantly reduces memory overhead through static memory planning and efficient model loading, addresses cross-device variability through a tunable kernel library, and introduces templated GPU kernels that support performant implementations of numerous quantization formats, enabling broad model support and extensibility to new formats. We evaluate LlamaWeb on 16 devices from 8 vendors, collecting data from 10 language models and four model weight formats. We compare LlamaWeb against existing browser-based LLM frameworks and find that LlamaWeb requires 29-33% less memory across several combinations of device, browser, and operating system. We also evaluate LlamaWeb's performance against these frameworks and find that it increases decode throughput by 45-69% across four GPUs from separate vendors. In addition, we compare LlamaWeb's performance against other llama$.$cpp backends, where it is competitive with and even beats vendor-specific backend performance on some devices.
format Preprint
id arxiv_https___arxiv_org_abs_2605_20706
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
Levine, Reese
Sharma, Rithik
Jain, Nikhil
Ramesh, Abhijit
Chen, Zheyuan
Abbas, Neha
Contini, James
Sorensen, Tyler
Distributed, Parallel, and Cluster Computing
Artificial Intelligence
Machine Learning
Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets. To realize this opportunity, we present Llamas on the Web (LlamaWeb), a WebGPU backend for llama$.$cpp that enables memory-efficient and performance-portable LLM inference across a wide range of model weight formats in the browser. Our design significantly reduces memory overhead through static memory planning and efficient model loading, addresses cross-device variability through a tunable kernel library, and introduces templated GPU kernels that support performant implementations of numerous quantization formats, enabling broad model support and extensibility to new formats. We evaluate LlamaWeb on 16 devices from 8 vendors, collecting data from 10 language models and four model weight formats. We compare LlamaWeb against existing browser-based LLM frameworks and find that LlamaWeb requires 29-33% less memory across several combinations of device, browser, and operating system. We also evaluate LlamaWeb's performance against these frameworks and find that it increases decode throughput by 45-69% across four GPUs from separate vendors. In addition, we compare LlamaWeb's performance against other llama$.$cpp backends, where it is competitive with and even beats vendor-specific backend performance on some devices.
title Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
topic Distributed, Parallel, and Cluster Computing
Artificial Intelligence
Machine Learning
url https://arxiv.org/abs/2605.20706