:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Guo, Yipin, Joshi, Siddharth
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2605.01708
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling
by: Yang, Yuchen, et al.
Published: (2026)

ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression
by: Fan, Ruibo, et al.
Published: (2026)

UCCL-Zip: Lossless Compression Supercharged GPU Communication
by: Ma, Shuang, et al.
Published: (2026)

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving
by: Wu, Hanjiang, et al.
Published: (2026)

ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching
by: Xiang, Xingyu, et al.
Published: (2025)

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
by: Liu, Zedong, et al.
Published: (2026)

PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving
by: Woo, Sunghyeon, et al.
Published: (2026)

Learned Best-Effort LLM Serving
by: Jha, Siddharth, et al.
Published: (2024)

Efficient Multi-Adapter LLM Serving via Cross-Model KV-Cache Reuse with Activated LoRA
by: Li, Allison, et al.
Published: (2025)

AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving
by: Xu, Tianhao, et al.
Published: (2026)

ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
by: Wang, Shao, et al.
Published: (2026)

StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
by: Kumar, Satyam, et al.
Published: (2026)

Efficient Serving of LLM Applications with Probabilistic Demand Modeling
by: Liu, Yifei, et al.
Published: (2025)

Niyama : Breaking the Silos of LLM Inference Serving
by: Goel, Kanishk, et al.
Published: (2025)

On Evaluating Performance of LLM Inference Serving Systems
by: Agrawal, Amey, et al.
Published: (2025)

MatKV: Trading Compute for Flash Storage in LLM Inference
by: Shin, Kun-Woo, et al.
Published: (2025)

FDC: Fast KV Dimensionality Compression for Efficient LLM Inference
by: Zhang, Zeyu, et al.
Published: (2024)

Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
by: Brüel-Gabrielsson, Rickard, et al.
Published: (2024)

Efficiently Serving Large Multimodal Models Using EPD Disaggregation
by: Singh, Gursimran, et al.
Published: (2024)

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
by: Patel, Ishan, et al.
Published: (2026)

Synera: Synergistic LLM Serving across Device and Cloud at Scale
by: Wang, Genglin, et al.
Published: (2025)

Autellix: An Efficient Serving Engine for LLM Agents as General Programs
by: Luo, Michael, et al.
Published: (2025)

CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters
by: Huang, Shaoyuan, et al.
Published: (2026)

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
by: Zhao, Juntao, et al.
Published: (2024)

Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving
by: Shi, Xiaoxiang, et al.
Published: (2025)

Communication-Efficient Split Learning via Adaptive Feature-Wise Compression
by: Oh, Yongjeong, et al.
Published: (2023)

Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
by: Ma, Bole, et al.
Published: (2026)

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
by: Ye, Zihao, et al.
Published: (2025)

Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems
by: Jaiswal, Shashwat, et al.
Published: (2025)

MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference
by: Rhee, Myunghyun, et al.
Published: (2025)

VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving
by: Yu, Jiahuan, et al.
Published: (2025)

LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving
by: Hu, Huanqi, et al.
Published: (2025)

MoEless: Efficient MoE LLM Serving via Serverless Computing
by: Yu, Hanfei, et al.
Published: (2026)

RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure
by: Gao, Wei, et al.
Published: (2025)

DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs
by: Yao, Xiaozhe, et al.
Published: (2023)

EPD-Serve: A Flexible Multimodal EPD Disaggregation Inference Serving System On Ascend
by: Bai, Fan, et al.
Published: (2026)

ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving
by: Qiu, Haoran, et al.
Published: (2025)

EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices
by: Shen, Zheyu, et al.
Published: (2025)

LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure
by: Cho, Jaehong, et al.
Published: (2026)

SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication
by: Hu, Lucas, et al.
Published: (2026)