:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Huang, Tao, Chen, Pengfei, Gong, Kyoka, Hawk, Jocky, Bright, Zachary, Xie, Wenxin, Huang, Kecheng, Ji, Zhi
Format:	Preprint
Published:	2024
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2407.09486
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity
by: Lai, Ruiqi, et al.
Published: (2025)

DeepServe: Serverless Large Language Model Serving at Scale
by: Hu, Junhao, et al.
Published: (2025)

The High Cost of Keeping Warm: Characterizing Overhead in Serverless Autoscaling Policies
by: Kondrashov, Leonid, et al.
Published: (2025)

HydraServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds
by: Lou, Chiheng, et al.
Published: (2025)

CASA: A Framework for SLO and Carbon-Aware Autoscaling and Scheduling in Serverless Cloud Computing
by: Qi, S., et al.
Published: (2024)

Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting
by: Zhao, Zhixin, et al.
Published: (2024)

Hierarchical Autoscaling for Large Language Model Serving with Chiron
by: Patke, Archit, et al.
Published: (2025)

Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective Elasticity
by: Lv, Cunchi, et al.
Published: (2025)

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
by: Hu, Cunchen, et al.
Published: (2024)

SpotKube: Cost-Optimal Microservices Deployment with Cluster Autoscaling and Spot Pricing
by: Edirisinghe, Dasith, et al.
Published: (2024)

FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
by: Lin, Yanying, et al.
Published: (2025)

SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving
by: Zhou, Qihui, et al.
Published: (2025)

ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments
by: Jiang, Youhe, et al.
Published: (2025)

PROSERVE: Unified Multi-Priority Request Scheduling for LLM Serving
by: Huang, Weizhe, et al.
Published: (2025)

An SLO Driven and Cost-Aware Autoscaling Framework for Kubernetes
by: Punniyamoorthy, Vinoth, et al.
Published: (2025)

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference
by: Li, Rongzhi, et al.
Published: (2025)

Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud
by: Ghosh, Himel
Published: (2024)

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
by: Fu, Yao, et al.
Published: (2024)

EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration
by: Du, Jiangsu, et al.
Published: (2025)

Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
by: Jiang, Youhe, et al.
Published: (2025)

It Takes Two to Tango: Serverless Workflow Serving via Bilaterally Engaged Resource Adaptation
by: Wu, Jing, et al.
Published: (2025)

Self-adaptive, Requirements-driven Autoscaling of Microservices
by: Nunes, João Paulo Karol Santos, et al.
Published: (2024)

Proactive and Reactive Autoscaling Techniques for Edge Computing
by: Gupta, Suhrid, et al.
Published: (2025)

Zenix: Efficient Execution of Bulky Serverless Applications
by: Guo, Zhiyuan, et al.
Published: (2022)

Towards Resource-Efficient Serverless LLM Inference with SLINFER
by: Xu, Chuhao, et al.
Published: (2025)

Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving
by: Wang, Chao, et al.
Published: (2025)

Cosmos: A Cost Model for Serverless Workflows in the 3D Compute Continuum
by: Marcelino, Cynthia, et al.
Published: (2025)

Caching Aided Multi-Tenant Serverless Computing
by: Qiao, Chu, et al.
Published: (2024)

DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving
by: Yuan, Ying, et al.
Published: (2026)

Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation
by: Liang, Yunkai, et al.
Published: (2025)

Past-Future Scheduler for LLM Serving under SLA Guarantees
by: Gong, Ruihao, et al.
Published: (2025)

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving
by: Cheng, Ke, et al.
Published: (2024)

Multi-Objective Optimization of Consumer Group Autoscaling in Message Broker Systems
by: Landau, Diogo, et al.
Published: (2024)

ORACL: Optimized Reasoning for Autoscaling via Chain of Thought with LLMs for Microservices
by: Bai, Haoyu, et al.
Published: (2026)

MoEless: Efficient MoE LLM Serving via Serverless Computing
by: Yu, Hanfei, et al.
Published: (2026)

ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs
by: Sui, Yifan, et al.
Published: (2025)

BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization
by: Jiang, Youhe, et al.
Published: (2026)

PipeBoost: Resilient Pipelined Architecture for Fast Serverless LLM Scaling
by: Liu, Chongpeng, et al.
Published: (2025)

In Serverless, OS Scheduler Choice Costs Money: A Hybrid Scheduling Approach for Cheaper FaaS
by: Zhao, Yuxuan, et al.
Published: (2024)

SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning
by: Su, Jianchang, et al.
Published: (2026)