Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yang, Yuting, Yuan, Tiancheng, Hashim, Jamal, Garrett, Thiago, Qian, Jeffrey, Zhang, Ann, Wang, Yifan, Song, Weijia, Birman, Ken
Format:	Preprint
Published:	2025
Subjects:	Databases Artificial Intelligence
Online Access:	https://arxiv.org/abs/2511.02062
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918184675704832
author	Yang, Yuting Yuan, Tiancheng Hashim, Jamal Garrett, Thiago Qian, Jeffrey Zhang, Ann Wang, Yifan Song, Weijia Birman, Ken
author_facet	Yang, Yuting Yuan, Tiancheng Hashim, Jamal Garrett, Thiago Qian, Jeffrey Zhang, Ann Wang, Yifan Song, Weijia Birman, Ken
contents	There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_02062
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements Yang, Yuting Yuan, Tiancheng Hashim, Jamal Garrett, Thiago Qian, Jeffrey Zhang, Ann Wang, Yifan Song, Weijia Birman, Ken Databases Artificial Intelligence There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.
title	Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements
topic	Databases Artificial Intelligence
url	https://arxiv.org/abs/2511.02062

Similar Items