:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Song, Mingcong, Tang, Xinru, Hou, Fengfan, Li, Jing, Wei, Wei, Ma, Yipeng, Xiao, Runqiu, Si, Hongjie, Jiang, Dingcheng, Yin, Shouyi, Hu, Yang, Long, Guoping
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence Distributed, Parallel, and Cluster Computing Machine Learning
Online Access:	https://arxiv.org/abs/2412.18106
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
by: Zhong, Yinmin, et al.
Published: (2024)

Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving
by: Wang, Chao, et al.
Published: (2025)

MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference
by: Tang, Xinru, et al.
Published: (2025)

Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs
by: Chen, Xing, et al.
Published: (2025)

FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving
by: Hsieh, Chia-chi, et al.
Published: (2026)

Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving
by: Shi, Xiaoxiang, et al.
Published: (2025)

From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill
by: Lee, Gunjun, et al.
Published: (2025)

PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving
by: Woo, Sunghyeon, et al.
Published: (2026)

DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages
by: Kanani, Alish, et al.
Published: (2026)

LAPS: A Length-Aware-Prefill LLM Serving System
by: She, Jianshu, et al.
Published: (2026)

A Predictive and Synergistic Two-Layer Scheduling Framework for LLM Serving
by: Zhang, Yue, et al.
Published: (2025)

HydraInfer: Hybrid Disaggregated Scheduling for Multimodal Large Language Model Serving
by: Dong, Xianzhe, et al.
Published: (2025)

PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications
by: Du, Kuntai, et al.
Published: (2025)

Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling
by: Da, Wei, et al.
Published: (2025)

Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
by: Bu, Tianci, et al.
Published: (2026)

Cortex: Workflow-Aware Resource Pooling and Scheduling for Agentic Serving
by: Pagonas, Nikos, et al.
Published: (2025)

Past-Future Scheduler for LLM Serving under SLA Guarantees
by: Gong, Ruihao, et al.
Published: (2025)

PROSERVE: Unified Multi-Priority Request Scheduling for LLM Serving
by: Huang, Weizhe, et al.
Published: (2025)

RServe: Overlapping Encoding and Prefill for Efficient LMM Inference
by: Guo, Tianyu, et al.
Published: (2025)

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving
by: Cheng, Ke, et al.
Published: (2024)

PHWSOA: A Pareto-based Hybrid Whale-Seagull Scheduling for Multi-Objective Tasks in Cloud Computing
by: Zhao, Zhi, et al.
Published: (2025)

FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs
by: He, Xuan, et al.
Published: (2025)

SneakPeek: Data-Aware Model Selection and Scheduling for Inference Serving on the Edge
by: Wolfrath, Joel, et al.
Published: (2025)

Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting
by: Zhao, Zhixin, et al.
Published: (2024)

Equinox: Holistic Fair Scheduling in Serving Large Language Models
by: Wei, Zhixiang, et al.
Published: (2025)

Large-Scale LLM Inference with Heterogeneous Workloads: Prefill-Decode Contention and Asymptotically Optimal Control
by: Lin, Ruihan, et al.
Published: (2026)

HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling
by: Peng, You, et al.
Published: (2026)

CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing
by: Yuan, Yitao, et al.
Published: (2025)

EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration
by: Du, Jiangsu, et al.
Published: (2025)

FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
by: Chen, Wenyan, et al.
Published: (2026)

SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference
by: Zhang, Hengrui, et al.
Published: (2025)

CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
by: Li, Suyi, et al.
Published: (2024)

Locality-aware Fair Scheduling in LLM Serving
by: Cao, Shiyi, et al.
Published: (2025)

MixServe: An Automatic Distributed Serving System for MoE Models with Hybrid Parallelism Based on Fused Communication Algorithm
by: Zhou, Bowen, et al.
Published: (2026)

Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving
by: Sun, Tingyang, et al.
Published: (2026)

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
by: Qin, Ruoyu, et al.
Published: (2026)

Cronus: Efficient LLM inference on Heterogeneous GPU Clusters via Partially Disaggregated Prefill
by: Liu, Yunzhao, et al.
Published: (2025)

Llumnix: Dynamic Scheduling for Large Language Model Serving
by: Sun, Biao, et al.
Published: (2024)

Preble: Efficient Distributed Prompt Scheduling for LLM Serving
by: Srivatsa, Vikranth, et al.
Published: (2024)

In Serverless, OS Scheduler Choice Costs Money: A Hybrid Scheduling Approach for Cheaper FaaS
by: Zhao, Yuxuan, et al.
Published: (2024)