Saved in:
Bibliographic Details
Main Authors: Yang, Yuting, Yuan, Tiancheng, Hashim, Jamal, Garrett, Thiago, Qian, Jeffrey, Zhang, Ann, Wang, Yifan, Song, Weijia, Birman, Ken
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.02062
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918184675704832
author Yang, Yuting
Yuan, Tiancheng
Hashim, Jamal
Garrett, Thiago
Qian, Jeffrey
Zhang, Ann
Wang, Yifan
Song, Weijia
Birman, Ken
author_facet Yang, Yuting
Yuan, Tiancheng
Hashim, Jamal
Garrett, Thiago
Qian, Jeffrey
Zhang, Ann
Wang, Yifan
Song, Weijia
Birman, Ken
contents There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.
format Preprint
id arxiv_https___arxiv_org_abs_2511_02062
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements
Yang, Yuting
Yuan, Tiancheng
Hashim, Jamal
Garrett, Thiago
Qian, Jeffrey
Zhang, Ann
Wang, Yifan
Song, Weijia
Birman, Ken
Databases
Artificial Intelligence
There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.
title Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements
topic Databases
Artificial Intelligence
url https://arxiv.org/abs/2511.02062