Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Huang, Tao, Chen, Pengfei, Gong, Kyoka, Hawk, Jocky, Bright, Zachary, Xie, Wenxin, Huang, Kecheng, Ji, Zhi
Format:	Preprint
Published:	2024
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2407.09486
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Since the increasing popularity of large language model (LLM) backend systems, it is common and necessary to deploy stable serverless serving of LLM on multi-GPU clusters with autoscaling. However, there exist challenges because the diversity and co-location of applications in multi-GPU clusters will lead to low service quality and GPU utilization. To address them, we build ENOVA, a deployment, monitoring and autoscaling service towards serverless LLM serving. ENOVA deconstructs the execution process of LLM service comprehensively, based on which ENOVA designs a configuration recommendation module for automatic deployment on any GPU clusters and a performance detection module for autoscaling. On top of them, ENOVA implements a deployment execution engine for multi-GPU cluster scheduling. The experiment results show that ENOVA significantly outperforms other state-of-the-art methods and is suitable for wide deployment in large online systems.

Similar Items