Saved in:
| Main Authors: | Guo, Yipin, Joshi, Siddharth |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.01708 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling
by: Yang, Yuchen, et al.
Published: (2026)
by: Yang, Yuchen, et al.
Published: (2026)
ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression
by: Fan, Ruibo, et al.
Published: (2026)
by: Fan, Ruibo, et al.
Published: (2026)
UCCL-Zip: Lossless Compression Supercharged GPU Communication
by: Ma, Shuang, et al.
Published: (2026)
by: Ma, Shuang, et al.
Published: (2026)
How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving
by: Wu, Hanjiang, et al.
Published: (2026)
by: Wu, Hanjiang, et al.
Published: (2026)
ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching
by: Xiang, Xingyu, et al.
Published: (2025)
by: Xiang, Xingyu, et al.
Published: (2025)
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
by: Liu, Zedong, et al.
Published: (2026)
by: Liu, Zedong, et al.
Published: (2026)
PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving
by: Woo, Sunghyeon, et al.
Published: (2026)
by: Woo, Sunghyeon, et al.
Published: (2026)
Learned Best-Effort LLM Serving
by: Jha, Siddharth, et al.
Published: (2024)
by: Jha, Siddharth, et al.
Published: (2024)
Efficient Multi-Adapter LLM Serving via Cross-Model KV-Cache Reuse with Activated LoRA
by: Li, Allison, et al.
Published: (2025)
by: Li, Allison, et al.
Published: (2025)
AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving
by: Xu, Tianhao, et al.
Published: (2026)
by: Xu, Tianhao, et al.
Published: (2026)
ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
by: Wang, Shao, et al.
Published: (2026)
by: Wang, Shao, et al.
Published: (2026)
StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
by: Kumar, Satyam, et al.
Published: (2026)
by: Kumar, Satyam, et al.
Published: (2026)
Efficient Serving of LLM Applications with Probabilistic Demand Modeling
by: Liu, Yifei, et al.
Published: (2025)
by: Liu, Yifei, et al.
Published: (2025)
Niyama : Breaking the Silos of LLM Inference Serving
by: Goel, Kanishk, et al.
Published: (2025)
by: Goel, Kanishk, et al.
Published: (2025)
On Evaluating Performance of LLM Inference Serving Systems
by: Agrawal, Amey, et al.
Published: (2025)
by: Agrawal, Amey, et al.
Published: (2025)
MatKV: Trading Compute for Flash Storage in LLM Inference
by: Shin, Kun-Woo, et al.
Published: (2025)
by: Shin, Kun-Woo, et al.
Published: (2025)
FDC: Fast KV Dimensionality Compression for Efficient LLM Inference
by: Zhang, Zeyu, et al.
Published: (2024)
by: Zhang, Zeyu, et al.
Published: (2024)
Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
by: Brüel-Gabrielsson, Rickard, et al.
Published: (2024)
by: Brüel-Gabrielsson, Rickard, et al.
Published: (2024)
Efficiently Serving Large Multimodal Models Using EPD Disaggregation
by: Singh, Gursimran, et al.
Published: (2024)
by: Singh, Gursimran, et al.
Published: (2024)
PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
by: Patel, Ishan, et al.
Published: (2026)
by: Patel, Ishan, et al.
Published: (2026)
Synera: Synergistic LLM Serving across Device and Cloud at Scale
by: Wang, Genglin, et al.
Published: (2025)
by: Wang, Genglin, et al.
Published: (2025)
Autellix: An Efficient Serving Engine for LLM Agents as General Programs
by: Luo, Michael, et al.
Published: (2025)
by: Luo, Michael, et al.
Published: (2025)
CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters
by: Huang, Shaoyuan, et al.
Published: (2026)
by: Huang, Shaoyuan, et al.
Published: (2026)
LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
by: Zhao, Juntao, et al.
Published: (2024)
by: Zhao, Juntao, et al.
Published: (2024)
Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving
by: Shi, Xiaoxiang, et al.
Published: (2025)
by: Shi, Xiaoxiang, et al.
Published: (2025)
Communication-Efficient Split Learning via Adaptive Feature-Wise Compression
by: Oh, Yongjeong, et al.
Published: (2023)
by: Oh, Yongjeong, et al.
Published: (2023)
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
by: Ma, Bole, et al.
Published: (2026)
by: Ma, Bole, et al.
Published: (2026)
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
by: Ye, Zihao, et al.
Published: (2025)
by: Ye, Zihao, et al.
Published: (2025)
Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems
by: Jaiswal, Shashwat, et al.
Published: (2025)
by: Jaiswal, Shashwat, et al.
Published: (2025)
MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference
by: Rhee, Myunghyun, et al.
Published: (2025)
by: Rhee, Myunghyun, et al.
Published: (2025)
VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving
by: Yu, Jiahuan, et al.
Published: (2025)
by: Yu, Jiahuan, et al.
Published: (2025)
LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving
by: Hu, Huanqi, et al.
Published: (2025)
by: Hu, Huanqi, et al.
Published: (2025)
MoEless: Efficient MoE LLM Serving via Serverless Computing
by: Yu, Hanfei, et al.
Published: (2026)
by: Yu, Hanfei, et al.
Published: (2026)
RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure
by: Gao, Wei, et al.
Published: (2025)
by: Gao, Wei, et al.
Published: (2025)
DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs
by: Yao, Xiaozhe, et al.
Published: (2023)
by: Yao, Xiaozhe, et al.
Published: (2023)
EPD-Serve: A Flexible Multimodal EPD Disaggregation Inference Serving System On Ascend
by: Bai, Fan, et al.
Published: (2026)
by: Bai, Fan, et al.
Published: (2026)
ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving
by: Qiu, Haoran, et al.
Published: (2025)
by: Qiu, Haoran, et al.
Published: (2025)
EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices
by: Shen, Zheyu, et al.
Published: (2025)
by: Shen, Zheyu, et al.
Published: (2025)
LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure
by: Cho, Jaehong, et al.
Published: (2026)
by: Cho, Jaehong, et al.
Published: (2026)
SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication
by: Hu, Lucas, et al.
Published: (2026)
by: Hu, Lucas, et al.
Published: (2026)
Similar Items
-
ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling
by: Yang, Yuchen, et al.
Published: (2026) -
ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression
by: Fan, Ruibo, et al.
Published: (2026) -
UCCL-Zip: Lossless Compression Supercharged GPU Communication
by: Ma, Shuang, et al.
Published: (2026) -
How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving
by: Wu, Hanjiang, et al.
Published: (2026) -
ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching
by: Xiang, Xingyu, et al.
Published: (2025)