Saved in:
Bibliographic Details
Main Authors: Liu, Banruo, Lin, Wei-Yu, Fang, Minghao, Jiang, Yihan, Lai, Fan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2504.16397
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914572985696256
author Liu, Banruo
Lin, Wei-Yu
Fang, Minghao
Jiang, Yihan
Lai, Fan
author_facet Liu, Banruo
Lin, Wei-Yu
Fang, Minghao
Jiang, Yihan
Lai, Fan
contents The rise of compound AI serving that integrates multiple operators in a pipeline enables end-user applications such as generative AI-powered meeting companions, autonomous driving, and immersive gaming. These workloads span diverse deployment spaces, from cloud-only queries to edge-assisted ones across infrastructure tiers, often including both within an application. Achieving high service goodput -- i.e., meeting service level objectives (SLOs) for pipeline latency, accuracy, and costs -- requires joint planning of operators' placement, configuration, and resource allocation. However, diverse SLOs, varying runtime environments (e.g., heterogeneous device speeds), and a large volume of queries competing for shared infrastructure explode the planning space, making real-time serving and cost-efficient deployment intractable with existing advances. This paper presents Compass, the first SLO-aware query planner that optimizes large-scale compound AI workloads across diverse deployment spaces. Compass decomposes the many-query, multi-SLO planning problem into tractable subproblems while preserving global decision quality, exploiting plan similarities within and across queries to slash the search steps. It further improves per-step efficiency with a plan profiler that performs selective profiling to achieve high-fidelity performance estimates at a fraction of the profiling cost. At runtime, Compass performs query-plan bipartite matching to maximize SLO goodput under resource contentions. Real-world evaluations show that Compass improves service goodput by 2.4--5.1x, reduces deployment costs by 3.8--4.5x, and accelerates planning by 4.2--10.5x, achieving service responsiveness within seconds and near-optimal decision quality.
format Preprint
id arxiv_https___arxiv_org_abs_2504_16397
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Compass: SLO-aware Query Planner for Compound AI Serving at Scale
Liu, Banruo
Lin, Wei-Yu
Fang, Minghao
Jiang, Yihan
Lai, Fan
Databases
Machine Learning
The rise of compound AI serving that integrates multiple operators in a pipeline enables end-user applications such as generative AI-powered meeting companions, autonomous driving, and immersive gaming. These workloads span diverse deployment spaces, from cloud-only queries to edge-assisted ones across infrastructure tiers, often including both within an application. Achieving high service goodput -- i.e., meeting service level objectives (SLOs) for pipeline latency, accuracy, and costs -- requires joint planning of operators' placement, configuration, and resource allocation. However, diverse SLOs, varying runtime environments (e.g., heterogeneous device speeds), and a large volume of queries competing for shared infrastructure explode the planning space, making real-time serving and cost-efficient deployment intractable with existing advances. This paper presents Compass, the first SLO-aware query planner that optimizes large-scale compound AI workloads across diverse deployment spaces. Compass decomposes the many-query, multi-SLO planning problem into tractable subproblems while preserving global decision quality, exploiting plan similarities within and across queries to slash the search steps. It further improves per-step efficiency with a plan profiler that performs selective profiling to achieve high-fidelity performance estimates at a fraction of the profiling cost. At runtime, Compass performs query-plan bipartite matching to maximize SLO goodput under resource contentions. Real-world evaluations show that Compass improves service goodput by 2.4--5.1x, reduces deployment costs by 3.8--4.5x, and accelerates planning by 4.2--10.5x, achieving service responsiveness within seconds and near-optimal decision quality.
title Compass: SLO-aware Query Planner for Compound AI Serving at Scale
topic Databases
Machine Learning
url https://arxiv.org/abs/2504.16397