Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Banruo, Lin, Wei-Yu, Fang, Minghao, Jiang, Yihan, Lai, Fan
Format:	Preprint
Published:	2025
Subjects:	Databases Machine Learning
Online Access:	https://arxiv.org/abs/2504.16397
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914572985696256
author	Liu, Banruo Lin, Wei-Yu Fang, Minghao Jiang, Yihan Lai, Fan
author_facet	Liu, Banruo Lin, Wei-Yu Fang, Minghao Jiang, Yihan Lai, Fan
contents	The rise of compound AI serving that integrates multiple operators in a pipeline enables end-user applications such as generative AI-powered meeting companions, autonomous driving, and immersive gaming. These workloads span diverse deployment spaces, from cloud-only queries to edge-assisted ones across infrastructure tiers, often including both within an application. Achieving high service goodput -- i.e., meeting service level objectives (SLOs) for pipeline latency, accuracy, and costs -- requires joint planning of operators' placement, configuration, and resource allocation. However, diverse SLOs, varying runtime environments (e.g., heterogeneous device speeds), and a large volume of queries competing for shared infrastructure explode the planning space, making real-time serving and cost-efficient deployment intractable with existing advances. This paper presents Compass, the first SLO-aware query planner that optimizes large-scale compound AI workloads across diverse deployment spaces. Compass decomposes the many-query, multi-SLO planning problem into tractable subproblems while preserving global decision quality, exploiting plan similarities within and across queries to slash the search steps. It further improves per-step efficiency with a plan profiler that performs selective profiling to achieve high-fidelity performance estimates at a fraction of the profiling cost. At runtime, Compass performs query-plan bipartite matching to maximize SLO goodput under resource contentions. Real-world evaluations show that Compass improves service goodput by 2.4--5.1x, reduces deployment costs by 3.8--4.5x, and accelerates planning by 4.2--10.5x, achieving service responsiveness within seconds and near-optimal decision quality.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_16397
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Compass: SLO-aware Query Planner for Compound AI Serving at Scale Liu, Banruo Lin, Wei-Yu Fang, Minghao Jiang, Yihan Lai, Fan Databases Machine Learning The rise of compound AI serving that integrates multiple operators in a pipeline enables end-user applications such as generative AI-powered meeting companions, autonomous driving, and immersive gaming. These workloads span diverse deployment spaces, from cloud-only queries to edge-assisted ones across infrastructure tiers, often including both within an application. Achieving high service goodput -- i.e., meeting service level objectives (SLOs) for pipeline latency, accuracy, and costs -- requires joint planning of operators' placement, configuration, and resource allocation. However, diverse SLOs, varying runtime environments (e.g., heterogeneous device speeds), and a large volume of queries competing for shared infrastructure explode the planning space, making real-time serving and cost-efficient deployment intractable with existing advances. This paper presents Compass, the first SLO-aware query planner that optimizes large-scale compound AI workloads across diverse deployment spaces. Compass decomposes the many-query, multi-SLO planning problem into tractable subproblems while preserving global decision quality, exploiting plan similarities within and across queries to slash the search steps. It further improves per-step efficiency with a plan profiler that performs selective profiling to achieve high-fidelity performance estimates at a fraction of the profiling cost. At runtime, Compass performs query-plan bipartite matching to maximize SLO goodput under resource contentions. Real-world evaluations show that Compass improves service goodput by 2.4--5.1x, reduces deployment costs by 3.8--4.5x, and accelerates planning by 4.2--10.5x, achieving service responsiveness within seconds and near-optimal decision quality.
title	Compass: SLO-aware Query Planner for Compound AI Serving at Scale
topic	Databases Machine Learning
url	https://arxiv.org/abs/2504.16397

Similar Items