:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhan, Huiyou, Zhang, Xuan, Tan, Haisheng, Tian, Han, Yong, Dongping, Zhang, Junyang, Li, Xiang-Yang
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2501.09367
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Cloud Native System for LLM Inference Serving
by: Xu, Minxian, et al.
Published: (2025)

EdgeServing: Deadline-Aware Multi-DNN Serving at the Edge
by: Cao, Jiahe, et al.
Published: (2026)

Embedding Samples Dispatching for Recommendation Model Training in Edge Environments
by: Li, Guopeng, et al.
Published: (2025)

CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands
by: Wang, Weiye, et al.
Published: (2026)

A Pipelined Collaborative Speculative Decoding Framework for Efficient Edge-Cloud LLM Inference
by: Zhang, Yida, et al.
Published: (2026)

ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments
by: Jiang, Youhe, et al.
Published: (2025)

OCTOPINF: Workload-Aware Inference Serving for Edge Video Analytics
by: Nguyen, Thanh-Tung, et al.
Published: (2025)

HydraServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds
by: Lou, Chiheng, et al.
Published: (2025)

Efficient Routing of Inference Requests across LLM Instances in Cloud-Edge Computing
by: Yu, Shibo, et al.
Published: (2025)

PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
by: Han, Yunhe, et al.
Published: (2026)

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing
by: Zhang, Mingjin, et al.
Published: (2024)

SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling
by: Jaiswal, Shashwat, et al.
Published: (2025)

SLICE: SLO-Driven Scheduling for LLM Inference on Edge Computing Devices
by: Chow, Will
Published: (2025)

HybridFlow: Resource-Adaptive Subtask Routing for Efficient Edge-Cloud LLM Inference
by: Dong, Jiangwen, et al.
Published: (2025)

Collaborative Speculative Inference for Efficient LLM Inference Serving
by: Gao, Luyao, et al.
Published: (2025)

A Predictive and Synergistic Two-Layer Scheduling Framework for LLM Serving
by: Zhang, Yue, et al.
Published: (2025)

MoA-Off: Adaptive Heterogeneous Modality-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference
by: Yang, Zheming, et al.
Published: (2025)

SynergAI: Edge-to-Cloud Synergy for Architecture-Driven High-Performance Orchestration for AI Inference
by: Stathopoulou, Foteini, et al.
Published: (2025)

SneakPeek: Data-Aware Model Selection and Scheduling for Inference Serving on the Edge
by: Wolfrath, Joel, et al.
Published: (2025)

CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
by: Li, Suyi, et al.
Published: (2024)

GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources
by: Du, Boxiao, et al.
Published: (2026)

Decentralized LLM Inference over Edge Networks with Energy Harvesting
by: Khoshsirat, Aria, et al.
Published: (2024)

LLM-Driven Intent-Based Privacy-Aware Orchestration Across the Cloud-Edge Continuum
by: Su, Zijie, et al.
Published: (2026)

Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud
by: Ghosh, Himel
Published: (2024)

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving
by: Duan, Jiangfei, et al.
Published: (2024)

DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving
by: Ruan, Chaoyi, et al.
Published: (2025)

UELLM: A Unified and Efficient Approach for LLM Inference Serving
by: He, Yiyuan, et al.
Published: (2024)

Efficient Multi-round LLM Inference over Disaggregated Serving
by: He, Wenhao, et al.
Published: (2026)

FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs
by: He, Xuan, et al.
Published: (2025)

MSAO: Adaptive Modality Sparsity-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference
by: Yang, Zheming, et al.
Published: (2026)

DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance
by: Zhang, Yuning, et al.
Published: (2025)

RAPID: Redundancy-Aware and Compatibility-Optimal Edge-Cloud Partitioned Inference for Diverse VLA Models
by: Zheng, Zihao, et al.
Published: (2026)

Collaborative Inference and Learning between Edge SLMs and Cloud LLMs: A Survey of Algorithms, Execution, and Open Challenges
by: Li, Senyao, et al.
Published: (2025)

DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
by: Yu, Fengze, et al.
Published: (2025)

Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems
by: Wilkins, Grant, et al.
Published: (2024)

LIME:Accelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices
by: Sun, Mingyu, et al.
Published: (2025)

EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration
by: Du, Jiangsu, et al.
Published: (2025)

DOPD: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving
by: Liao, Junhan, et al.
Published: (2025)

Modular Foundation Model Inference at the Edge: Network-Aware Microservice Optimization
by: Zhu, Juan, et al.
Published: (2026)

Understanding the Performance and Power of LLM Inferencing on Edge Accelerators
by: Arya, Mayank, et al.
Published: (2025)