Salvato in:
Dettagli Bibliografici
Autori principali: Zhu, Wenbin, Shen, Zhaoyan, Shao, Zili, Dai, Hongjun, Chen, Feng
Natura: Preprint
Pubblicazione: 2025
Soggetti:
Accesso online:https://arxiv.org/abs/2512.01357
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866911296281116672
author Zhu, Wenbin
Shen, Zhaoyan
Shao, Zili
Dai, Hongjun
Chen, Feng
author_facet Zhu, Wenbin
Shen, Zhaoyan
Shao, Zili
Dai, Hongjun
Chen, Feng
contents Serverless Large Language Models (LLMs) have emerged as a cost-effective solution for deploying AI services by enabling a 'pay-as-you-go' pricing model through GPU resource sharing. However, cold-start latency, especially the model loading phase, has become a critical performance bottleneck, as it scales linearly with model size and severely limits the practical deployment of large-scale LLM services. This paper presents Tangram, a novel system that accelerates Serverless LLM loading through efficient GPU memory reuse. By leveraging the unused GPU memory to retain model parameters, Tangram significantly reduces model transfer time and cold-start latency. Its design includes three key components: unified GPU memory pool for tensor-level parameter sharing across models, on-demand KV cache allocation for dynamic memory management, and GPU-affinity-aware scheduling for maximizing resource utilization. These techniques collectively address the critical challenges of inefficient memory usage and the cold-start problem in Serverless LLM platforms. We have implemented a fully functional prototype, and experiments show that Tangram achieves up to 6.2 times faster loading and reduces Time-To-First-Token (TTFT) during cold-start by 23--55% over state-of-the-art methods.
format Preprint
id arxiv_https___arxiv_org_abs_2512_01357
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity
Zhu, Wenbin
Shen, Zhaoyan
Shao, Zili
Dai, Hongjun
Chen, Feng
Distributed, Parallel, and Cluster Computing
Artificial Intelligence
Hardware Architecture
Serverless Large Language Models (LLMs) have emerged as a cost-effective solution for deploying AI services by enabling a 'pay-as-you-go' pricing model through GPU resource sharing. However, cold-start latency, especially the model loading phase, has become a critical performance bottleneck, as it scales linearly with model size and severely limits the practical deployment of large-scale LLM services. This paper presents Tangram, a novel system that accelerates Serverless LLM loading through efficient GPU memory reuse. By leveraging the unused GPU memory to retain model parameters, Tangram significantly reduces model transfer time and cold-start latency. Its design includes three key components: unified GPU memory pool for tensor-level parameter sharing across models, on-demand KV cache allocation for dynamic memory management, and GPU-affinity-aware scheduling for maximizing resource utilization. These techniques collectively address the critical challenges of inefficient memory usage and the cold-start problem in Serverless LLM platforms. We have implemented a fully functional prototype, and experiments show that Tangram achieves up to 6.2 times faster loading and reduces Time-To-First-Token (TTFT) during cold-start by 23--55% over state-of-the-art methods.
title Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity
topic Distributed, Parallel, and Cluster Computing
Artificial Intelligence
Hardware Architecture
url https://arxiv.org/abs/2512.01357