Saved in:
| Main Authors: | Yu, Lingfan, Lin, Jinkun, Li, Jinyang |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2312.05516 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CRAFT: Fine-Grained Cost-Aware Expert Replication For Efficient Mixture-of-Experts Serving
by: Zhao, Adrian, et al.
Published: (2026)
by: Zhao, Adrian, et al.
Published: (2026)
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
by: Griggs, Tyler, et al.
Published: (2024)
by: Griggs, Tyler, et al.
Published: (2024)
Understanding Stragglers in Large Model Training Using What-if Analysis
by: Lin, Jinkun, et al.
Published: (2025)
by: Lin, Jinkun, et al.
Published: (2025)
Towards Sustainable Large Language Model Serving
by: Nguyen, Sophia, et al.
Published: (2024)
by: Nguyen, Sophia, et al.
Published: (2024)
Fast Distributed Inference Serving for Large Language Models
by: Wu, Bingyang, et al.
Published: (2023)
by: Wu, Bingyang, et al.
Published: (2023)
P/D-Serve: Serving Disaggregated Large Language Model at Scale
by: Jin, Yibo, et al.
Published: (2024)
by: Jin, Yibo, et al.
Published: (2024)
LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
by: Wu, Bingyang, et al.
Published: (2024)
by: Wu, Bingyang, et al.
Published: (2024)
Llumnix: Dynamic Scheduling for Large Language Model Serving
by: Sun, Biao, et al.
Published: (2024)
by: Sun, Biao, et al.
Published: (2024)
FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving
by: Shen, Ao, et al.
Published: (2024)
by: Shen, Ao, et al.
Published: (2024)
Towards Resiliency in Large Language Model Serving with KevlarFlow
by: Qian, Shangshu, et al.
Published: (2026)
by: Qian, Shangshu, et al.
Published: (2026)
Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud
by: Ghosh, Himel
Published: (2024)
by: Ghosh, Himel
Published: (2024)
DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
by: Yu, Fengze, et al.
Published: (2025)
by: Yu, Fengze, et al.
Published: (2025)
A Universal Load Balancing Principle and Its Application to Large Language Model Serving
by: Chen, Zixi, et al.
Published: (2026)
by: Chen, Zixi, et al.
Published: (2026)
CascadeServe: Unlocking Model Cascades for Inference Serving
by: Kossmann, Ferdi, et al.
Published: (2024)
by: Kossmann, Ferdi, et al.
Published: (2024)
TetriServe: Efficient DiT Serving for Heterogeneous Image Generation
by: Lu, Runyu, et al.
Published: (2025)
by: Lu, Runyu, et al.
Published: (2025)
SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
by: Du, Zhixu, et al.
Published: (2023)
by: Du, Zhixu, et al.
Published: (2023)
PolyServe: Efficient Multi-SLO Serving at Scale
by: Zhu, Kan, et al.
Published: (2025)
by: Zhu, Kan, et al.
Published: (2025)
SLOs-Serve: Optimized Serving of Multi-SLO LLMs
by: Chen, Siyuan, et al.
Published: (2025)
by: Chen, Siyuan, et al.
Published: (2025)
Symphony: Optimized DNN Model Serving using Deferred Batch Scheduling
by: Chen, Lequn, et al.
Published: (2023)
by: Chen, Lequn, et al.
Published: (2023)
ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving
by: Qiao, Yifan, et al.
Published: (2024)
by: Qiao, Yifan, et al.
Published: (2024)
OMEGA: A Low-Latency GNN Serving System for Large Graphs
by: Kim, Geon-Woo, et al.
Published: (2025)
by: Kim, Geon-Woo, et al.
Published: (2025)
DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
by: Maurya, Avinash, et al.
Published: (2024)
by: Maurya, Avinash, et al.
Published: (2024)
MUSE: Multi-Tenant Model Serving With Seamless Model Updates
by: Correia, Cláudio, et al.
Published: (2026)
by: Correia, Cláudio, et al.
Published: (2026)
EdgeServe: A Streaming System for Decentralized Model Serving
by: Shaowang, Ted, et al.
Published: (2023)
by: Shaowang, Ted, et al.
Published: (2023)
Echo: Efficient Co-Scheduling of Hybrid Online-Offline Tasks for Large Language Model Serving
by: Wang, Zhibin, et al.
Published: (2025)
by: Wang, Zhibin, et al.
Published: (2025)
Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow
by: Mei, Yixuan, et al.
Published: (2024)
by: Mei, Yixuan, et al.
Published: (2024)
WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving
by: Lou, Chiheng, et al.
Published: (2025)
by: Lou, Chiheng, et al.
Published: (2025)
EPIC: Efficient Position-Independent Caching for Serving Large Language Models
by: Hu, Junhao, et al.
Published: (2024)
by: Hu, Junhao, et al.
Published: (2024)
Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving
by: Ma, Jeff J., et al.
Published: (2025)
by: Ma, Jeff J., et al.
Published: (2025)
SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification
by: Miao, Xupeng, et al.
Published: (2023)
by: Miao, Xupeng, et al.
Published: (2023)
Serving Large Language Models on Huawei CloudMatrix384
by: Zuo, Pengfei, et al.
Published: (2025)
by: Zuo, Pengfei, et al.
Published: (2025)
Preble: Efficient Distributed Prompt Scheduling for LLM Serving
by: Srivatsa, Vikranth, et al.
Published: (2024)
by: Srivatsa, Vikranth, et al.
Published: (2024)
VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving
by: Yu, Jiahuan, et al.
Published: (2025)
by: Yu, Jiahuan, et al.
Published: (2025)
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
by: Zhu, Ruidong, et al.
Published: (2025)
by: Zhu, Ruidong, et al.
Published: (2025)
Locality-aware Fair Scheduling in LLM Serving
by: Cao, Shiyi, et al.
Published: (2025)
by: Cao, Shiyi, et al.
Published: (2025)
DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs
by: Yao, Xiaozhe, et al.
Published: (2023)
by: Yao, Xiaozhe, et al.
Published: (2023)
Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models
by: Chung, Jae-Won, et al.
Published: (2026)
by: Chung, Jae-Won, et al.
Published: (2026)
Characterization of Large Language Model Development in the Datacenter
by: Hu, Qinghao, et al.
Published: (2024)
by: Hu, Qinghao, et al.
Published: (2024)
FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving
by: Bin, Kyungmin, et al.
Published: (2025)
by: Bin, Kyungmin, et al.
Published: (2025)
Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation
by: Chen, Shaoyuan, et al.
Published: (2024)
by: Chen, Shaoyuan, et al.
Published: (2024)
Similar Items
-
CRAFT: Fine-Grained Cost-Aware Expert Replication For Efficient Mixture-of-Experts Serving
by: Zhao, Adrian, et al.
Published: (2026) -
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
by: Griggs, Tyler, et al.
Published: (2024) -
Understanding Stragglers in Large Model Training Using What-if Analysis
by: Lin, Jinkun, et al.
Published: (2025) -
Towards Sustainable Large Language Model Serving
by: Nguyen, Sophia, et al.
Published: (2024) -
Fast Distributed Inference Serving for Large Language Models
by: Wu, Bingyang, et al.
Published: (2023)