:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Xie, Jincheng, Ling, Yawen, Xiao, Qi, Zhang, Feiyu, Huang, Zhongyi, Hu, Wen, Zheng, Yu
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.08151
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Collaborative Speculative Inference for Efficient LLM Inference Serving
by: Gao, Luyao, et al.
Published: (2025)

Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
by: Wei, Jinhui, et al.
Published: (2025)

UELLM: A Unified and Efficient Approach for LLM Inference Serving
by: He, Yiyuan, et al.
Published: (2024)

Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism
by: Li, Cong, et al.
Published: (2025)

HybridFlow: Resource-Adaptive Subtask Routing for Efficient Edge-Cloud LLM Inference
by: Dong, Jiangwen, et al.
Published: (2025)

GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources
by: Du, Boxiao, et al.
Published: (2026)

HAP: Hybrid Adaptive Parallelism for Efficient Mixture-of-Experts Inference
by: Lin, Haoran, et al.
Published: (2025)

Efficient Multi-round LLM Inference over Disaggregated Serving
by: He, Wenhao, et al.
Published: (2026)

BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving
by: Zheng, Wanyi, et al.
Published: (2025)

FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
by: Chen, Wenyan, et al.
Published: (2026)

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
by: Hu, Cunchen, et al.
Published: (2024)

gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
by: Guo, Tianyu, et al.
Published: (2025)

PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
by: Han, Yunhe, et al.
Published: (2026)

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
by: Nian, Sean, et al.
Published: (2026)

SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving
by: Zhou, Qihui, et al.
Published: (2025)

SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference
by: He, Yongchao, et al.
Published: (2025)

A Pipelined Collaborative Speculative Decoding Framework for Efficient Edge-Cloud LLM Inference
by: Zhang, Yida, et al.
Published: (2026)

S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance
by: Liu, Di, et al.
Published: (2026)

StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
by: Kumar, Satyam, et al.
Published: (2026)

FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs
by: He, Xuan, et al.
Published: (2025)

Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism
by: Srivatsa, Vikranth, et al.
Published: (2026)

Cloud Native System for LLM Inference Serving
by: Xu, Minxian, et al.
Published: (2025)

Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification
by: Huang, Guang, et al.
Published: (2026)

EconoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving
by: Shen, Haiying, et al.
Published: (2024)

SpecInF: Exploiting Idle GPU Resources in Distributed DL Training via Speculative Inference Filling
by: Lv, Cunchi, et al.
Published: (2025)

CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands
by: Wang, Weiye, et al.
Published: (2026)

Communication-Efficient Serving for Video Diffusion Models with Latent Parallelism
by: Wu, Zhiyuan, et al.
Published: (2025)

FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
by: Liu, Xing, et al.
Published: (2025)

SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference
by: Zhao, Alan, et al.
Published: (2026)

Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective Elasticity
by: Lv, Cunchi, et al.
Published: (2025)

MixServe: An Automatic Distributed Serving System for MoE Models with Hybrid Parallelism Based on Fused Communication Algorithm
by: Zhou, Bowen, et al.
Published: (2026)

Towards Resource-Efficient Serverless LLM Inference with SLINFER
by: Xu, Chuhao, et al.
Published: (2025)

PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via Pool-Based Pipeline Parallelism
by: Kong, Z. Jonny, et al.
Published: (2025)

eLLM: Elastic Memory Management Framework for Efficient LLM Serving
by: Xu, Jiale, et al.
Published: (2025)

CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
by: Li, Suyi, et al.
Published: (2024)

DOPD: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving
by: Liao, Junhan, et al.
Published: (2025)

APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving
by: Lin, Yi-Chien, et al.
Published: (2024)

Efficient LLM Inference with Activation Checkpointing and Hybrid Caching
by: Lee, Sanghyeon, et al.
Published: (2025)

Hybrid-Parallel: Achieving High Performance and Energy Efficient Distributed Inference on Robots
by: Sun, Zekai, et al.
Published: (2024)

SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission
by: Zheng, Ce, et al.
Published: (2026)