Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Martin, Noah, Dogar, Fahad
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2602.18931
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911461348999168
author	Martin, Noah Dogar, Fahad
author_facet	Martin, Noah Dogar, Fahad
contents	Data centers capable of running large language models (LLMs) are spread across the globe. Some have high end GPUs for running the most advanced models (100B+ parameters), and others are only suitable for smaller models (1B parameters). The most capable GPUs are under high demand thanks to the rapidly expanding applications of LLMs. Choosing the right location to run an LLM inference workload can have consequences on the latency of requests due to these high demands. In this work, we explore options to shift some aspects of inference to the under-utilized data centers. We first observe the varying delays affecting inference in AWS services from different regions, demonstrating that load is not spread evenly. We then introduce WANSpec, which offloads part of LLM generation to the under-utilized data centers. In doing so, WANSpec can mitigate capacity issues as well as effectively use on-site compute (ie at universities) to augment cloud providers. This is done with speculative decoding, a widely used technique to speed up auto-regressive decoding, by moving the draft model to the under-utilized compute resources. Our experiments in simulation and cloud deployments show that WANSpec can judiciously employ redundancy to avoid increases in latency while still reducing the forward passes of speculative decoding's draft model in high demand data centers by over 50%.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_18931
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	WANSpec: Leveraging Global Compute Capacity for LLM Inference Martin, Noah Dogar, Fahad Distributed, Parallel, and Cluster Computing Data centers capable of running large language models (LLMs) are spread across the globe. Some have high end GPUs for running the most advanced models (100B+ parameters), and others are only suitable for smaller models (1B parameters). The most capable GPUs are under high demand thanks to the rapidly expanding applications of LLMs. Choosing the right location to run an LLM inference workload can have consequences on the latency of requests due to these high demands. In this work, we explore options to shift some aspects of inference to the under-utilized data centers. We first observe the varying delays affecting inference in AWS services from different regions, demonstrating that load is not spread evenly. We then introduce WANSpec, which offloads part of LLM generation to the under-utilized data centers. In doing so, WANSpec can mitigate capacity issues as well as effectively use on-site compute (ie at universities) to augment cloud providers. This is done with speculative decoding, a widely used technique to speed up auto-regressive decoding, by moving the draft model to the under-utilized compute resources. Our experiments in simulation and cloud deployments show that WANSpec can judiciously employ redundancy to avoid increases in latency while still reducing the forward passes of speculative decoding's draft model in high demand data centers by over 50%.
title	WANSpec: Leveraging Global Compute Capacity for LLM Inference
topic	Distributed, Parallel, and Cluster Computing
url	https://arxiv.org/abs/2602.18931

Similar Items