Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Qu, Xingyu, Lin, Tianhao, Li, Yiqi, Chen, Zhiyu, Wang, Sheng
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2605.08581
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915996084731904
author	Qu, Xingyu Lin, Tianhao Li, Yiqi Chen, Zhiyu Wang, Sheng
author_facet	Qu, Xingyu Lin, Tianhao Li, Yiqi Chen, Zhiyu Wang, Sheng
contents	Modern online large language model (LLM) services, such as Retrieval-Augmented Generation (RAG) and agent systems, increasingly expose two prominent characteristics: prompt segmentation (e.g., system instructions, retrieved passages, tool outputs) and hotspot skew, where a small set of these segments recurs frequently across user requests. Failing to jointly exploit these patterns could lead to repeated prefill of hot segments and prolonged TTFT, undermining both throughput and user-perceived responsiveness. However, existing work tackles these patterns independently: KV-cache management mainly exploits segment reuse while scheduling reorders requests to improve cache locality, yet neither aligns request admission with KV-cache retention. To address this gap, we first analyze how scheduling and KV-cache management jointly affect TTFT. Guided by this, we present PRISM (Prefix Reuse Optimization Integrated Scheduling and Memory), which co-designs a query-aware scheduler (QAS) with a demand-aware radix tree (DART) to align request admission with exact-prefix KV retention. Our evaluation results show that, versus the strongest baseline, PRISM reduces average per-QPS P99 TTFT by 23.3\% and 37.1\% while increasing exact-prefix KV-cache hit rate by 5.9 and 12.2 percentage points on 4B and 13B models, respectively.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_08581
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design Qu, Xingyu Lin, Tianhao Li, Yiqi Chen, Zhiyu Wang, Sheng Machine Learning Modern online large language model (LLM) services, such as Retrieval-Augmented Generation (RAG) and agent systems, increasingly expose two prominent characteristics: prompt segmentation (e.g., system instructions, retrieved passages, tool outputs) and hotspot skew, where a small set of these segments recurs frequently across user requests. Failing to jointly exploit these patterns could lead to repeated prefill of hot segments and prolonged TTFT, undermining both throughput and user-perceived responsiveness. However, existing work tackles these patterns independently: KV-cache management mainly exploits segment reuse while scheduling reorders requests to improve cache locality, yet neither aligns request admission with KV-cache retention. To address this gap, we first analyze how scheduling and KV-cache management jointly affect TTFT. Guided by this, we present PRISM (Prefix Reuse Optimization Integrated Scheduling and Memory), which co-designs a query-aware scheduler (QAS) with a demand-aware radix tree (DART) to align request admission with exact-prefix KV retention. Our evaluation results show that, versus the strongest baseline, PRISM reduces average per-QPS P99 TTFT by 23.3\% and 37.1\% while increasing exact-prefix KV-cache hit rate by 5.9 and 12.2 percentage points on 4B and 13B models, respectively.
title	PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design
topic	Machine Learning
url	https://arxiv.org/abs/2605.08581

Similar Items