Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Levine, Reese, Sharma, Rithik, Jain, Nikhil, Ramesh, Abhijit, Chen, Zheyuan, Abbas, Neha, Contini, James, Sorensen, Tyler
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2605.20706
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910240554876928
author	Levine, Reese Sharma, Rithik Jain, Nikhil Ramesh, Abhijit Chen, Zheyuan Abbas, Neha Contini, James Sorensen, Tyler
author_facet	Levine, Reese Sharma, Rithik Jain, Nikhil Ramesh, Abhijit Chen, Zheyuan Abbas, Neha Contini, James Sorensen, Tyler
contents	Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets. To realize this opportunity, we present Llamas on the Web (LlamaWeb), a WebGPU backend for llama$.$cpp that enables memory-efficient and performance-portable LLM inference across a wide range of model weight formats in the browser. Our design significantly reduces memory overhead through static memory planning and efficient model loading, addresses cross-device variability through a tunable kernel library, and introduces templated GPU kernels that support performant implementations of numerous quantization formats, enabling broad model support and extensibility to new formats. We evaluate LlamaWeb on 16 devices from 8 vendors, collecting data from 10 language models and four model weight formats. We compare LlamaWeb against existing browser-based LLM frameworks and find that LlamaWeb requires 29-33% less memory across several combinations of device, browser, and operating system. We also evaluate LlamaWeb's performance against these frameworks and find that it increases decode throughput by 45-69% across four GPUs from separate vendors. In addition, we compare LlamaWeb's performance against other llama$.$cpp backends, where it is competitive with and even beats vendor-specific backend performance on some devices.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_20706
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU Levine, Reese Sharma, Rithik Jain, Nikhil Ramesh, Abhijit Chen, Zheyuan Abbas, Neha Contini, James Sorensen, Tyler Distributed, Parallel, and Cluster Computing Artificial Intelligence Machine Learning Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets. To realize this opportunity, we present Llamas on the Web (LlamaWeb), a WebGPU backend for llama$.$cpp that enables memory-efficient and performance-portable LLM inference across a wide range of model weight formats in the browser. Our design significantly reduces memory overhead through static memory planning and efficient model loading, addresses cross-device variability through a tunable kernel library, and introduces templated GPU kernels that support performant implementations of numerous quantization formats, enabling broad model support and extensibility to new formats. We evaluate LlamaWeb on 16 devices from 8 vendors, collecting data from 10 language models and four model weight formats. We compare LlamaWeb against existing browser-based LLM frameworks and find that LlamaWeb requires 29-33% less memory across several combinations of device, browser, and operating system. We also evaluate LlamaWeb's performance against these frameworks and find that it increases decode throughput by 45-69% across four GPUs from separate vendors. In addition, we compare LlamaWeb's performance against other llama$.$cpp backends, where it is competitive with and even beats vendor-specific backend performance on some devices.
title	Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
topic	Distributed, Parallel, and Cluster Computing Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2605.20706

Similar Items