MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Zhu, Wenbin, Shen, Zhaoyan, Shao, Zili, Dai, Hongjun, Chen, Feng
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Distributed, Parallel, and Cluster Computing Artificial Intelligence Hardware Architecture
Accesso online:	https://arxiv.org/abs/2512.01357
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866911296281116672
author	Zhu, Wenbin Shen, Zhaoyan Shao, Zili Dai, Hongjun Chen, Feng
author_facet	Zhu, Wenbin Shen, Zhaoyan Shao, Zili Dai, Hongjun Chen, Feng
contents	Serverless Large Language Models (LLMs) have emerged as a cost-effective solution for deploying AI services by enabling a 'pay-as-you-go' pricing model through GPU resource sharing. However, cold-start latency, especially the model loading phase, has become a critical performance bottleneck, as it scales linearly with model size and severely limits the practical deployment of large-scale LLM services. This paper presents Tangram, a novel system that accelerates Serverless LLM loading through efficient GPU memory reuse. By leveraging the unused GPU memory to retain model parameters, Tangram significantly reduces model transfer time and cold-start latency. Its design includes three key components: unified GPU memory pool for tensor-level parameter sharing across models, on-demand KV cache allocation for dynamic memory management, and GPU-affinity-aware scheduling for maximizing resource utilization. These techniques collectively address the critical challenges of inefficient memory usage and the cold-start problem in Serverless LLM platforms. We have implemented a fully functional prototype, and experiments show that Tangram achieves up to 6.2 times faster loading and reduces Time-To-First-Token (TTFT) during cold-start by 23--55% over state-of-the-art methods.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_01357
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity Zhu, Wenbin Shen, Zhaoyan Shao, Zili Dai, Hongjun Chen, Feng Distributed, Parallel, and Cluster Computing Artificial Intelligence Hardware Architecture Serverless Large Language Models (LLMs) have emerged as a cost-effective solution for deploying AI services by enabling a 'pay-as-you-go' pricing model through GPU resource sharing. However, cold-start latency, especially the model loading phase, has become a critical performance bottleneck, as it scales linearly with model size and severely limits the practical deployment of large-scale LLM services. This paper presents Tangram, a novel system that accelerates Serverless LLM loading through efficient GPU memory reuse. By leveraging the unused GPU memory to retain model parameters, Tangram significantly reduces model transfer time and cold-start latency. Its design includes three key components: unified GPU memory pool for tensor-level parameter sharing across models, on-demand KV cache allocation for dynamic memory management, and GPU-affinity-aware scheduling for maximizing resource utilization. These techniques collectively address the critical challenges of inefficient memory usage and the cold-start problem in Serverless LLM platforms. We have implemented a fully functional prototype, and experiments show that Tangram achieves up to 6.2 times faster loading and reduces Time-To-First-Token (TTFT) during cold-start by 23--55% over state-of-the-art methods.
title	Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity
topic	Distributed, Parallel, and Cluster Computing Artificial Intelligence Hardware Architecture
url	https://arxiv.org/abs/2512.01357

Documenti analoghi