Saved in:
| Main Authors: | Huang, Tao, Chen, Pengfei, Gong, Kyoka, Hawk, Jocky, Bright, Zachary, Xie, Wenxin, Huang, Kecheng, Ji, Zhi |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.09486 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity
by: Lai, Ruiqi, et al.
Published: (2025)
by: Lai, Ruiqi, et al.
Published: (2025)
DeepServe: Serverless Large Language Model Serving at Scale
by: Hu, Junhao, et al.
Published: (2025)
by: Hu, Junhao, et al.
Published: (2025)
The High Cost of Keeping Warm: Characterizing Overhead in Serverless Autoscaling Policies
by: Kondrashov, Leonid, et al.
Published: (2025)
by: Kondrashov, Leonid, et al.
Published: (2025)
HydraServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds
by: Lou, Chiheng, et al.
Published: (2025)
by: Lou, Chiheng, et al.
Published: (2025)
CASA: A Framework for SLO and Carbon-Aware Autoscaling and Scheduling in Serverless Cloud Computing
by: Qi, S., et al.
Published: (2024)
by: Qi, S., et al.
Published: (2024)
Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting
by: Zhao, Zhixin, et al.
Published: (2024)
by: Zhao, Zhixin, et al.
Published: (2024)
Hierarchical Autoscaling for Large Language Model Serving with Chiron
by: Patke, Archit, et al.
Published: (2025)
by: Patke, Archit, et al.
Published: (2025)
Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective Elasticity
by: Lv, Cunchi, et al.
Published: (2025)
by: Lv, Cunchi, et al.
Published: (2025)
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
by: Hu, Cunchen, et al.
Published: (2024)
by: Hu, Cunchen, et al.
Published: (2024)
SpotKube: Cost-Optimal Microservices Deployment with Cluster Autoscaling and Spot Pricing
by: Edirisinghe, Dasith, et al.
Published: (2024)
by: Edirisinghe, Dasith, et al.
Published: (2024)
FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
by: Lin, Yanying, et al.
Published: (2025)
by: Lin, Yanying, et al.
Published: (2025)
SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving
by: Zhou, Qihui, et al.
Published: (2025)
by: Zhou, Qihui, et al.
Published: (2025)
ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments
by: Jiang, Youhe, et al.
Published: (2025)
by: Jiang, Youhe, et al.
Published: (2025)
PROSERVE: Unified Multi-Priority Request Scheduling for LLM Serving
by: Huang, Weizhe, et al.
Published: (2025)
by: Huang, Weizhe, et al.
Published: (2025)
An SLO Driven and Cost-Aware Autoscaling Framework for Kubernetes
by: Punniyamoorthy, Vinoth, et al.
Published: (2025)
by: Punniyamoorthy, Vinoth, et al.
Published: (2025)
Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference
by: Li, Rongzhi, et al.
Published: (2025)
by: Li, Rongzhi, et al.
Published: (2025)
Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud
by: Ghosh, Himel
Published: (2024)
by: Ghosh, Himel
Published: (2024)
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
by: Fu, Yao, et al.
Published: (2024)
by: Fu, Yao, et al.
Published: (2024)
EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration
by: Du, Jiangsu, et al.
Published: (2025)
by: Du, Jiangsu, et al.
Published: (2025)
Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
by: Jiang, Youhe, et al.
Published: (2025)
by: Jiang, Youhe, et al.
Published: (2025)
It Takes Two to Tango: Serverless Workflow Serving via Bilaterally Engaged Resource Adaptation
by: Wu, Jing, et al.
Published: (2025)
by: Wu, Jing, et al.
Published: (2025)
Self-adaptive, Requirements-driven Autoscaling of Microservices
by: Nunes, João Paulo Karol Santos, et al.
Published: (2024)
by: Nunes, João Paulo Karol Santos, et al.
Published: (2024)
Proactive and Reactive Autoscaling Techniques for Edge Computing
by: Gupta, Suhrid, et al.
Published: (2025)
by: Gupta, Suhrid, et al.
Published: (2025)
Zenix: Efficient Execution of Bulky Serverless Applications
by: Guo, Zhiyuan, et al.
Published: (2022)
by: Guo, Zhiyuan, et al.
Published: (2022)
Towards Resource-Efficient Serverless LLM Inference with SLINFER
by: Xu, Chuhao, et al.
Published: (2025)
by: Xu, Chuhao, et al.
Published: (2025)
Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving
by: Wang, Chao, et al.
Published: (2025)
by: Wang, Chao, et al.
Published: (2025)
Cosmos: A Cost Model for Serverless Workflows in the 3D Compute Continuum
by: Marcelino, Cynthia, et al.
Published: (2025)
by: Marcelino, Cynthia, et al.
Published: (2025)
Caching Aided Multi-Tenant Serverless Computing
by: Qiao, Chu, et al.
Published: (2024)
by: Qiao, Chu, et al.
Published: (2024)
DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving
by: Yuan, Ying, et al.
Published: (2026)
by: Yuan, Ying, et al.
Published: (2026)
Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation
by: Liang, Yunkai, et al.
Published: (2025)
by: Liang, Yunkai, et al.
Published: (2025)
Past-Future Scheduler for LLM Serving under SLA Guarantees
by: Gong, Ruihao, et al.
Published: (2025)
by: Gong, Ruihao, et al.
Published: (2025)
Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving
by: Cheng, Ke, et al.
Published: (2024)
by: Cheng, Ke, et al.
Published: (2024)
Multi-Objective Optimization of Consumer Group Autoscaling in Message Broker Systems
by: Landau, Diogo, et al.
Published: (2024)
by: Landau, Diogo, et al.
Published: (2024)
ORACL: Optimized Reasoning for Autoscaling via Chain of Thought with LLMs for Microservices
by: Bai, Haoyu, et al.
Published: (2026)
by: Bai, Haoyu, et al.
Published: (2026)
MoEless: Efficient MoE LLM Serving via Serverless Computing
by: Yu, Hanfei, et al.
Published: (2026)
by: Yu, Hanfei, et al.
Published: (2026)
ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs
by: Sui, Yifan, et al.
Published: (2025)
by: Sui, Yifan, et al.
Published: (2025)
BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization
by: Jiang, Youhe, et al.
Published: (2026)
by: Jiang, Youhe, et al.
Published: (2026)
PipeBoost: Resilient Pipelined Architecture for Fast Serverless LLM Scaling
by: Liu, Chongpeng, et al.
Published: (2025)
by: Liu, Chongpeng, et al.
Published: (2025)
In Serverless, OS Scheduler Choice Costs Money: A Hybrid Scheduling Approach for Cheaper FaaS
by: Zhao, Yuxuan, et al.
Published: (2024)
by: Zhao, Yuxuan, et al.
Published: (2024)
SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning
by: Su, Jianchang, et al.
Published: (2026)
by: Su, Jianchang, et al.
Published: (2026)
Similar Items
-
TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity
by: Lai, Ruiqi, et al.
Published: (2025) -
DeepServe: Serverless Large Language Model Serving at Scale
by: Hu, Junhao, et al.
Published: (2025) -
The High Cost of Keeping Warm: Characterizing Overhead in Serverless Autoscaling Policies
by: Kondrashov, Leonid, et al.
Published: (2025) -
HydraServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds
by: Lou, Chiheng, et al.
Published: (2025) -
CASA: A Framework for SLO and Carbon-Aware Autoscaling and Scheduling in Serverless Cloud Computing
by: Qi, S., et al.
Published: (2024)