Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yang, Yuanyuan, Zhang, Ruimin, Morgenstern, Jamie, Xu, Haifeng
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Computer Science and Game Theory
Online Access:	https://arxiv.org/abs/2509.22992
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912609733705728
author	Yang, Yuanyuan Zhang, Ruimin Morgenstern, Jamie Xu, Haifeng
author_facet	Yang, Yuanyuan Zhang, Ruimin Morgenstern, Jamie Xu, Haifeng
contents	As machine learning models continue to grow in size and complexity, efficient serving faces increasingly broad trade-offs spanning accuracy, latency, resource usage, and other objectives. Multi-model serving further complicates these trade-offs; for example, in cascaded models, each early-exit decision balances latency reduction against potential accuracy loss. Despite the pervasiveness and importance of such trade-offs, current strategies remain largely heuristic and case-specific, limiting both their theoretical guarantees and general applicability. We present a general framework, T-Tamer, which formalizes this setting as a multi-stage decision process, where the objective is to determine both when to exit and which model to consult. Our main result shows that recall (i.e., the ability to revisit earlier models) is both necessary and sufficient for achieving provable performance guarantees. In particular, we prove that strategies without recall cannot obtain any constant-factor approximation to the optimal trade-off, whereas recall-based strategies provably attain the optimal trade-off in polynomial time. We validate our analysis through experiments on synthetic datasets and early-exit workloads for vision and NLP benchmarks. The results show that recall-based strategies consistently yield efficient accuracy-latency trade-offs. We hope this work provides a principled foundation for bridging heuristic practice with theoretical guarantees in the design of early-exit and cascaded models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_22992
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	T-TAMER: Provably Taming Trade-offs in ML Serving Yang, Yuanyuan Zhang, Ruimin Morgenstern, Jamie Xu, Haifeng Machine Learning Computer Science and Game Theory As machine learning models continue to grow in size and complexity, efficient serving faces increasingly broad trade-offs spanning accuracy, latency, resource usage, and other objectives. Multi-model serving further complicates these trade-offs; for example, in cascaded models, each early-exit decision balances latency reduction against potential accuracy loss. Despite the pervasiveness and importance of such trade-offs, current strategies remain largely heuristic and case-specific, limiting both their theoretical guarantees and general applicability. We present a general framework, T-Tamer, which formalizes this setting as a multi-stage decision process, where the objective is to determine both when to exit and which model to consult. Our main result shows that recall (i.e., the ability to revisit earlier models) is both necessary and sufficient for achieving provable performance guarantees. In particular, we prove that strategies without recall cannot obtain any constant-factor approximation to the optimal trade-off, whereas recall-based strategies provably attain the optimal trade-off in polynomial time. We validate our analysis through experiments on synthetic datasets and early-exit workloads for vision and NLP benchmarks. The results show that recall-based strategies consistently yield efficient accuracy-latency trade-offs. We hope this work provides a principled foundation for bridging heuristic practice with theoretical guarantees in the design of early-exit and cascaded models.
title	T-TAMER: Provably Taming Trade-offs in ML Serving
topic	Machine Learning Computer Science and Game Theory
url	https://arxiv.org/abs/2509.22992

Similar Items