Saved in:
| Main Authors: | Xie, Jincheng, Ling, Yawen, Xiao, Qi, Zhang, Feiyu, Huang, Zhongyi, Hu, Wen, Zheng, Yu |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.08151 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Collaborative Speculative Inference for Efficient LLM Inference Serving
by: Gao, Luyao, et al.
Published: (2025)
by: Gao, Luyao, et al.
Published: (2025)
Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
by: Wei, Jinhui, et al.
Published: (2025)
by: Wei, Jinhui, et al.
Published: (2025)
UELLM: A Unified and Efficient Approach for LLM Inference Serving
by: He, Yiyuan, et al.
Published: (2024)
by: He, Yiyuan, et al.
Published: (2024)
Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism
by: Li, Cong, et al.
Published: (2025)
by: Li, Cong, et al.
Published: (2025)
HybridFlow: Resource-Adaptive Subtask Routing for Efficient Edge-Cloud LLM Inference
by: Dong, Jiangwen, et al.
Published: (2025)
by: Dong, Jiangwen, et al.
Published: (2025)
GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources
by: Du, Boxiao, et al.
Published: (2026)
by: Du, Boxiao, et al.
Published: (2026)
HAP: Hybrid Adaptive Parallelism for Efficient Mixture-of-Experts Inference
by: Lin, Haoran, et al.
Published: (2025)
by: Lin, Haoran, et al.
Published: (2025)
Efficient Multi-round LLM Inference over Disaggregated Serving
by: He, Wenhao, et al.
Published: (2026)
by: He, Wenhao, et al.
Published: (2026)
BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving
by: Zheng, Wanyi, et al.
Published: (2025)
by: Zheng, Wanyi, et al.
Published: (2025)
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
by: Chen, Wenyan, et al.
Published: (2026)
by: Chen, Wenyan, et al.
Published: (2026)
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
by: Hu, Cunchen, et al.
Published: (2024)
by: Hu, Cunchen, et al.
Published: (2024)
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
by: Guo, Tianyu, et al.
Published: (2025)
by: Guo, Tianyu, et al.
Published: (2025)
PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
by: Han, Yunhe, et al.
Published: (2026)
by: Han, Yunhe, et al.
Published: (2026)
CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
by: Nian, Sean, et al.
Published: (2026)
by: Nian, Sean, et al.
Published: (2026)
SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving
by: Zhou, Qihui, et al.
Published: (2025)
by: Zhou, Qihui, et al.
Published: (2025)
SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference
by: He, Yongchao, et al.
Published: (2025)
by: He, Yongchao, et al.
Published: (2025)
A Pipelined Collaborative Speculative Decoding Framework for Efficient Edge-Cloud LLM Inference
by: Zhang, Yida, et al.
Published: (2026)
by: Zhang, Yida, et al.
Published: (2026)
S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance
by: Liu, Di, et al.
Published: (2026)
by: Liu, Di, et al.
Published: (2026)
StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
by: Kumar, Satyam, et al.
Published: (2026)
by: Kumar, Satyam, et al.
Published: (2026)
FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs
by: He, Xuan, et al.
Published: (2025)
by: He, Xuan, et al.
Published: (2025)
Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism
by: Srivatsa, Vikranth, et al.
Published: (2026)
by: Srivatsa, Vikranth, et al.
Published: (2026)
Cloud Native System for LLM Inference Serving
by: Xu, Minxian, et al.
Published: (2025)
by: Xu, Minxian, et al.
Published: (2025)
Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification
by: Huang, Guang, et al.
Published: (2026)
by: Huang, Guang, et al.
Published: (2026)
EconoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving
by: Shen, Haiying, et al.
Published: (2024)
by: Shen, Haiying, et al.
Published: (2024)
SpecInF: Exploiting Idle GPU Resources in Distributed DL Training via Speculative Inference Filling
by: Lv, Cunchi, et al.
Published: (2025)
by: Lv, Cunchi, et al.
Published: (2025)
CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands
by: Wang, Weiye, et al.
Published: (2026)
by: Wang, Weiye, et al.
Published: (2026)
Communication-Efficient Serving for Video Diffusion Models with Latent Parallelism
by: Wu, Zhiyuan, et al.
Published: (2025)
by: Wu, Zhiyuan, et al.
Published: (2025)
FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
by: Liu, Xing, et al.
Published: (2025)
by: Liu, Xing, et al.
Published: (2025)
SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference
by: Zhao, Alan, et al.
Published: (2026)
by: Zhao, Alan, et al.
Published: (2026)
Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective Elasticity
by: Lv, Cunchi, et al.
Published: (2025)
by: Lv, Cunchi, et al.
Published: (2025)
MixServe: An Automatic Distributed Serving System for MoE Models with Hybrid Parallelism Based on Fused Communication Algorithm
by: Zhou, Bowen, et al.
Published: (2026)
by: Zhou, Bowen, et al.
Published: (2026)
Towards Resource-Efficient Serverless LLM Inference with SLINFER
by: Xu, Chuhao, et al.
Published: (2025)
by: Xu, Chuhao, et al.
Published: (2025)
PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via Pool-Based Pipeline Parallelism
by: Kong, Z. Jonny, et al.
Published: (2025)
by: Kong, Z. Jonny, et al.
Published: (2025)
eLLM: Elastic Memory Management Framework for Efficient LLM Serving
by: Xu, Jiale, et al.
Published: (2025)
by: Xu, Jiale, et al.
Published: (2025)
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
by: Li, Suyi, et al.
Published: (2024)
by: Li, Suyi, et al.
Published: (2024)
DOPD: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving
by: Liao, Junhan, et al.
Published: (2025)
by: Liao, Junhan, et al.
Published: (2025)
APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving
by: Lin, Yi-Chien, et al.
Published: (2024)
by: Lin, Yi-Chien, et al.
Published: (2024)
Efficient LLM Inference with Activation Checkpointing and Hybrid Caching
by: Lee, Sanghyeon, et al.
Published: (2025)
by: Lee, Sanghyeon, et al.
Published: (2025)
Hybrid-Parallel: Achieving High Performance and Energy Efficient Distributed Inference on Robots
by: Sun, Zekai, et al.
Published: (2024)
by: Sun, Zekai, et al.
Published: (2024)
SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission
by: Zheng, Ce, et al.
Published: (2026)
by: Zheng, Ce, et al.
Published: (2026)
Similar Items
-
Collaborative Speculative Inference for Efficient LLM Inference Serving
by: Gao, Luyao, et al.
Published: (2025) -
Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
by: Wei, Jinhui, et al.
Published: (2025) -
UELLM: A Unified and Efficient Approach for LLM Inference Serving
by: He, Yiyuan, et al.
Published: (2024) -
Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism
by: Li, Cong, et al.
Published: (2025) -
HybridFlow: Resource-Adaptive Subtask Routing for Efficient Edge-Cloud LLM Inference
by: Dong, Jiangwen, et al.
Published: (2025)