Saved in:
| Main Authors: | Zhan, Huiyou, Zhang, Xuan, Tan, Haisheng, Tian, Han, Yong, Dongping, Zhang, Junyang, Li, Xiang-Yang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.09367 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Cloud Native System for LLM Inference Serving
by: Xu, Minxian, et al.
Published: (2025)
by: Xu, Minxian, et al.
Published: (2025)
EdgeServing: Deadline-Aware Multi-DNN Serving at the Edge
by: Cao, Jiahe, et al.
Published: (2026)
by: Cao, Jiahe, et al.
Published: (2026)
Embedding Samples Dispatching for Recommendation Model Training in Edge Environments
by: Li, Guopeng, et al.
Published: (2025)
by: Li, Guopeng, et al.
Published: (2025)
CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands
by: Wang, Weiye, et al.
Published: (2026)
by: Wang, Weiye, et al.
Published: (2026)
A Pipelined Collaborative Speculative Decoding Framework for Efficient Edge-Cloud LLM Inference
by: Zhang, Yida, et al.
Published: (2026)
by: Zhang, Yida, et al.
Published: (2026)
ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments
by: Jiang, Youhe, et al.
Published: (2025)
by: Jiang, Youhe, et al.
Published: (2025)
OCTOPINF: Workload-Aware Inference Serving for Edge Video Analytics
by: Nguyen, Thanh-Tung, et al.
Published: (2025)
by: Nguyen, Thanh-Tung, et al.
Published: (2025)
HydraServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds
by: Lou, Chiheng, et al.
Published: (2025)
by: Lou, Chiheng, et al.
Published: (2025)
Efficient Routing of Inference Requests across LLM Instances in Cloud-Edge Computing
by: Yu, Shibo, et al.
Published: (2025)
by: Yu, Shibo, et al.
Published: (2025)
PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
by: Han, Yunhe, et al.
Published: (2026)
by: Han, Yunhe, et al.
Published: (2026)
EdgeShard: Efficient LLM Inference via Collaborative Edge Computing
by: Zhang, Mingjin, et al.
Published: (2024)
by: Zhang, Mingjin, et al.
Published: (2024)
SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling
by: Jaiswal, Shashwat, et al.
Published: (2025)
by: Jaiswal, Shashwat, et al.
Published: (2025)
SLICE: SLO-Driven Scheduling for LLM Inference on Edge Computing Devices
by: Chow, Will
Published: (2025)
by: Chow, Will
Published: (2025)
HybridFlow: Resource-Adaptive Subtask Routing for Efficient Edge-Cloud LLM Inference
by: Dong, Jiangwen, et al.
Published: (2025)
by: Dong, Jiangwen, et al.
Published: (2025)
Collaborative Speculative Inference for Efficient LLM Inference Serving
by: Gao, Luyao, et al.
Published: (2025)
by: Gao, Luyao, et al.
Published: (2025)
A Predictive and Synergistic Two-Layer Scheduling Framework for LLM Serving
by: Zhang, Yue, et al.
Published: (2025)
by: Zhang, Yue, et al.
Published: (2025)
MoA-Off: Adaptive Heterogeneous Modality-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference
by: Yang, Zheming, et al.
Published: (2025)
by: Yang, Zheming, et al.
Published: (2025)
SynergAI: Edge-to-Cloud Synergy for Architecture-Driven High-Performance Orchestration for AI Inference
by: Stathopoulou, Foteini, et al.
Published: (2025)
by: Stathopoulou, Foteini, et al.
Published: (2025)
SneakPeek: Data-Aware Model Selection and Scheduling for Inference Serving on the Edge
by: Wolfrath, Joel, et al.
Published: (2025)
by: Wolfrath, Joel, et al.
Published: (2025)
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
by: Li, Suyi, et al.
Published: (2024)
by: Li, Suyi, et al.
Published: (2024)
GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources
by: Du, Boxiao, et al.
Published: (2026)
by: Du, Boxiao, et al.
Published: (2026)
Decentralized LLM Inference over Edge Networks with Energy Harvesting
by: Khoshsirat, Aria, et al.
Published: (2024)
by: Khoshsirat, Aria, et al.
Published: (2024)
LLM-Driven Intent-Based Privacy-Aware Orchestration Across the Cloud-Edge Continuum
by: Su, Zijie, et al.
Published: (2026)
by: Su, Zijie, et al.
Published: (2026)
Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud
by: Ghosh, Himel
Published: (2024)
by: Ghosh, Himel
Published: (2024)
MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving
by: Duan, Jiangfei, et al.
Published: (2024)
by: Duan, Jiangfei, et al.
Published: (2024)
DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving
by: Ruan, Chaoyi, et al.
Published: (2025)
by: Ruan, Chaoyi, et al.
Published: (2025)
UELLM: A Unified and Efficient Approach for LLM Inference Serving
by: He, Yiyuan, et al.
Published: (2024)
by: He, Yiyuan, et al.
Published: (2024)
Efficient Multi-round LLM Inference over Disaggregated Serving
by: He, Wenhao, et al.
Published: (2026)
by: He, Wenhao, et al.
Published: (2026)
FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs
by: He, Xuan, et al.
Published: (2025)
by: He, Xuan, et al.
Published: (2025)
MSAO: Adaptive Modality Sparsity-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference
by: Yang, Zheming, et al.
Published: (2026)
by: Yang, Zheming, et al.
Published: (2026)
DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance
by: Zhang, Yuning, et al.
Published: (2025)
by: Zhang, Yuning, et al.
Published: (2025)
RAPID: Redundancy-Aware and Compatibility-Optimal Edge-Cloud Partitioned Inference for Diverse VLA Models
by: Zheng, Zihao, et al.
Published: (2026)
by: Zheng, Zihao, et al.
Published: (2026)
Collaborative Inference and Learning between Edge SLMs and Cloud LLMs: A Survey of Algorithms, Execution, and Open Challenges
by: Li, Senyao, et al.
Published: (2025)
by: Li, Senyao, et al.
Published: (2025)
DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
by: Yu, Fengze, et al.
Published: (2025)
by: Yu, Fengze, et al.
Published: (2025)
Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems
by: Wilkins, Grant, et al.
Published: (2024)
by: Wilkins, Grant, et al.
Published: (2024)
LIME:Accelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices
by: Sun, Mingyu, et al.
Published: (2025)
by: Sun, Mingyu, et al.
Published: (2025)
EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration
by: Du, Jiangsu, et al.
Published: (2025)
by: Du, Jiangsu, et al.
Published: (2025)
DOPD: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving
by: Liao, Junhan, et al.
Published: (2025)
by: Liao, Junhan, et al.
Published: (2025)
Modular Foundation Model Inference at the Edge: Network-Aware Microservice Optimization
by: Zhu, Juan, et al.
Published: (2026)
by: Zhu, Juan, et al.
Published: (2026)
Understanding the Performance and Power of LLM Inferencing on Edge Accelerators
by: Arya, Mayank, et al.
Published: (2025)
by: Arya, Mayank, et al.
Published: (2025)
Similar Items
-
Cloud Native System for LLM Inference Serving
by: Xu, Minxian, et al.
Published: (2025) -
EdgeServing: Deadline-Aware Multi-DNN Serving at the Edge
by: Cao, Jiahe, et al.
Published: (2026) -
Embedding Samples Dispatching for Recommendation Model Training in Edge Environments
by: Li, Guopeng, et al.
Published: (2025) -
CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands
by: Wang, Weiye, et al.
Published: (2026) -
A Pipelined Collaborative Speculative Decoding Framework for Efficient Edge-Cloud LLM Inference
by: Zhang, Yida, et al.
Published: (2026)