Saved in:
| Main Authors: | Song, Mingcong, Tang, Xinru, Hou, Fengfan, Li, Jing, Wei, Wei, Ma, Yipeng, Xiao, Runqiu, Si, Hongjie, Jiang, Dingcheng, Yin, Shouyi, Hu, Yang, Long, Guoping |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.18106 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
by: Zhong, Yinmin, et al.
Published: (2024)
by: Zhong, Yinmin, et al.
Published: (2024)
Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving
by: Wang, Chao, et al.
Published: (2025)
by: Wang, Chao, et al.
Published: (2025)
MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference
by: Tang, Xinru, et al.
Published: (2025)
by: Tang, Xinru, et al.
Published: (2025)
Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs
by: Chen, Xing, et al.
Published: (2025)
by: Chen, Xing, et al.
Published: (2025)
FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving
by: Hsieh, Chia-chi, et al.
Published: (2026)
by: Hsieh, Chia-chi, et al.
Published: (2026)
Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving
by: Shi, Xiaoxiang, et al.
Published: (2025)
by: Shi, Xiaoxiang, et al.
Published: (2025)
From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill
by: Lee, Gunjun, et al.
Published: (2025)
by: Lee, Gunjun, et al.
Published: (2025)
PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving
by: Woo, Sunghyeon, et al.
Published: (2026)
by: Woo, Sunghyeon, et al.
Published: (2026)
DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages
by: Kanani, Alish, et al.
Published: (2026)
by: Kanani, Alish, et al.
Published: (2026)
LAPS: A Length-Aware-Prefill LLM Serving System
by: She, Jianshu, et al.
Published: (2026)
by: She, Jianshu, et al.
Published: (2026)
A Predictive and Synergistic Two-Layer Scheduling Framework for LLM Serving
by: Zhang, Yue, et al.
Published: (2025)
by: Zhang, Yue, et al.
Published: (2025)
HydraInfer: Hybrid Disaggregated Scheduling for Multimodal Large Language Model Serving
by: Dong, Xianzhe, et al.
Published: (2025)
by: Dong, Xianzhe, et al.
Published: (2025)
PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications
by: Du, Kuntai, et al.
Published: (2025)
by: Du, Kuntai, et al.
Published: (2025)
Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling
by: Da, Wei, et al.
Published: (2025)
by: Da, Wei, et al.
Published: (2025)
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
by: Bu, Tianci, et al.
Published: (2026)
by: Bu, Tianci, et al.
Published: (2026)
Cortex: Workflow-Aware Resource Pooling and Scheduling for Agentic Serving
by: Pagonas, Nikos, et al.
Published: (2025)
by: Pagonas, Nikos, et al.
Published: (2025)
Past-Future Scheduler for LLM Serving under SLA Guarantees
by: Gong, Ruihao, et al.
Published: (2025)
by: Gong, Ruihao, et al.
Published: (2025)
PROSERVE: Unified Multi-Priority Request Scheduling for LLM Serving
by: Huang, Weizhe, et al.
Published: (2025)
by: Huang, Weizhe, et al.
Published: (2025)
RServe: Overlapping Encoding and Prefill for Efficient LMM Inference
by: Guo, Tianyu, et al.
Published: (2025)
by: Guo, Tianyu, et al.
Published: (2025)
Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving
by: Cheng, Ke, et al.
Published: (2024)
by: Cheng, Ke, et al.
Published: (2024)
PHWSOA: A Pareto-based Hybrid Whale-Seagull Scheduling for Multi-Objective Tasks in Cloud Computing
by: Zhao, Zhi, et al.
Published: (2025)
by: Zhao, Zhi, et al.
Published: (2025)
FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs
by: He, Xuan, et al.
Published: (2025)
by: He, Xuan, et al.
Published: (2025)
SneakPeek: Data-Aware Model Selection and Scheduling for Inference Serving on the Edge
by: Wolfrath, Joel, et al.
Published: (2025)
by: Wolfrath, Joel, et al.
Published: (2025)
Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting
by: Zhao, Zhixin, et al.
Published: (2024)
by: Zhao, Zhixin, et al.
Published: (2024)
Equinox: Holistic Fair Scheduling in Serving Large Language Models
by: Wei, Zhixiang, et al.
Published: (2025)
by: Wei, Zhixiang, et al.
Published: (2025)
Large-Scale LLM Inference with Heterogeneous Workloads: Prefill-Decode Contention and Asymptotically Optimal Control
by: Lin, Ruihan, et al.
Published: (2026)
by: Lin, Ruihan, et al.
Published: (2026)
HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling
by: Peng, You, et al.
Published: (2026)
by: Peng, You, et al.
Published: (2026)
CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing
by: Yuan, Yitao, et al.
Published: (2025)
by: Yuan, Yitao, et al.
Published: (2025)
EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration
by: Du, Jiangsu, et al.
Published: (2025)
by: Du, Jiangsu, et al.
Published: (2025)
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
by: Chen, Wenyan, et al.
Published: (2026)
by: Chen, Wenyan, et al.
Published: (2026)
SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference
by: Zhang, Hengrui, et al.
Published: (2025)
by: Zhang, Hengrui, et al.
Published: (2025)
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
by: Li, Suyi, et al.
Published: (2024)
by: Li, Suyi, et al.
Published: (2024)
Locality-aware Fair Scheduling in LLM Serving
by: Cao, Shiyi, et al.
Published: (2025)
by: Cao, Shiyi, et al.
Published: (2025)
MixServe: An Automatic Distributed Serving System for MoE Models with Hybrid Parallelism Based on Fused Communication Algorithm
by: Zhou, Bowen, et al.
Published: (2026)
by: Zhou, Bowen, et al.
Published: (2026)
Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving
by: Sun, Tingyang, et al.
Published: (2026)
by: Sun, Tingyang, et al.
Published: (2026)
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
by: Qin, Ruoyu, et al.
Published: (2026)
by: Qin, Ruoyu, et al.
Published: (2026)
Cronus: Efficient LLM inference on Heterogeneous GPU Clusters via Partially Disaggregated Prefill
by: Liu, Yunzhao, et al.
Published: (2025)
by: Liu, Yunzhao, et al.
Published: (2025)
Llumnix: Dynamic Scheduling for Large Language Model Serving
by: Sun, Biao, et al.
Published: (2024)
by: Sun, Biao, et al.
Published: (2024)
Preble: Efficient Distributed Prompt Scheduling for LLM Serving
by: Srivatsa, Vikranth, et al.
Published: (2024)
by: Srivatsa, Vikranth, et al.
Published: (2024)
In Serverless, OS Scheduler Choice Costs Money: A Hybrid Scheduling Approach for Cheaper FaaS
by: Zhao, Yuxuan, et al.
Published: (2024)
by: Zhao, Yuxuan, et al.
Published: (2024)
Similar Items
-
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
by: Zhong, Yinmin, et al.
Published: (2024) -
Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving
by: Wang, Chao, et al.
Published: (2025) -
MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference
by: Tang, Xinru, et al.
Published: (2025) -
Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs
by: Chen, Xing, et al.
Published: (2025) -
FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving
by: Hsieh, Chia-chi, et al.
Published: (2026)