Saved in:
| Main Authors: | Bournias, Ilias, Cavigelli, Lukas, Zacharopoulos, Georgios |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.05555 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SkyWalker: A Locality-Aware Cross-Region Load Balancer for LLM Inference
by: Xia, Tian, et al.
Published: (2025)
by: Xia, Tian, et al.
Published: (2025)
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
by: Yüzügüler, Ahmet Caner, et al.
Published: (2025)
by: Yüzügüler, Ahmet Caner, et al.
Published: (2025)
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
by: Bu, Tianci, et al.
Published: (2026)
by: Bu, Tianci, et al.
Published: (2026)
Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving
by: Cheng, Ke, et al.
Published: (2024)
by: Cheng, Ke, et al.
Published: (2024)
Understanding the Performance and Power of LLM Inferencing on Edge Accelerators
by: Arya, Mayank, et al.
Published: (2025)
by: Arya, Mayank, et al.
Published: (2025)
DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving
by: Yuan, Ying, et al.
Published: (2026)
by: Yuan, Ying, et al.
Published: (2026)
CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing
by: Yuan, Yitao, et al.
Published: (2025)
by: Yuan, Yitao, et al.
Published: (2025)
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
by: Kim, Joon Ha, et al.
Published: (2026)
by: Kim, Joon Ha, et al.
Published: (2026)
S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance
by: Liu, Di, et al.
Published: (2026)
by: Liu, Di, et al.
Published: (2026)
LIME:Accelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices
by: Sun, Mingyu, et al.
Published: (2025)
by: Sun, Mingyu, et al.
Published: (2025)
ReaLB: Real-Time Load Balancing for Multimodal MoE Inference
by: Wang, Yingping, et al.
Published: (2026)
by: Wang, Yingping, et al.
Published: (2026)
LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind
by: Zhang, Li, et al.
Published: (2025)
by: Zhang, Li, et al.
Published: (2025)
Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling
by: Da, Wei, et al.
Published: (2025)
by: Da, Wei, et al.
Published: (2025)
Accelerating LLM Inference with Precomputed Query Storage
by: Park, Jay H., et al.
Published: (2025)
by: Park, Jay H., et al.
Published: (2025)
SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference
by: Zhao, Alan, et al.
Published: (2026)
by: Zhao, Alan, et al.
Published: (2026)
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
by: Guo, Tianyu, et al.
Published: (2025)
by: Guo, Tianyu, et al.
Published: (2025)
Ethereal: Divide and Conquer Network Load Balancing in Large-Scale Distributed Training
by: Addanki, Vamsi, et al.
Published: (2024)
by: Addanki, Vamsi, et al.
Published: (2024)
Accelerating Compound LLM Training Workloads with Maestro
by: Yuan, Xiulong, et al.
Published: (2026)
by: Yuan, Xiulong, et al.
Published: (2026)
Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing
by: Jain, Kunal, et al.
Published: (2024)
by: Jain, Kunal, et al.
Published: (2024)
LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling
by: Da, Wei, et al.
Published: (2026)
by: Da, Wei, et al.
Published: (2026)
VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
by: Liu, Zihan, et al.
Published: (2025)
by: Liu, Zihan, et al.
Published: (2025)
Cloud Native System for LLM Inference Serving
by: Xu, Minxian, et al.
Published: (2025)
by: Xu, Minxian, et al.
Published: (2025)
Enabling Dynamic Sparsity in Quantized LLM Inference
by: Wang, Rongxiang, et al.
Published: (2025)
by: Wang, Rongxiang, et al.
Published: (2025)
Federated Inference for Heterogeneous LLM Communication and Collaboration
by: Chen, Zihan, et al.
Published: (2026)
by: Chen, Zihan, et al.
Published: (2026)
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
by: Hu, Cunchen, et al.
Published: (2024)
by: Hu, Cunchen, et al.
Published: (2024)
Fail-Closed Lowering of Resident KV Claims onto LLM Serving Runtimes
by: Stepanek, Lukas
Published: (2026)
by: Stepanek, Lukas
Published: (2026)
LAAFD: LLM-based Agents for Accelerated FPGA Design
by: Moraru, Maxim, et al.
Published: (2026)
by: Moraru, Maxim, et al.
Published: (2026)
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
by: Butler, Branden, et al.
Published: (2024)
by: Butler, Branden, et al.
Published: (2024)
SYMPHONY: Improving Memory Management for LLM Inference Workloads
by: Agarwal, Saurabh, et al.
Published: (2024)
by: Agarwal, Saurabh, et al.
Published: (2024)
From Attention to Disaggregation: Tracing the Evolution of LLM Inference
by: Kumar, Madabattula Rajesh, et al.
Published: (2025)
by: Kumar, Madabattula Rajesh, et al.
Published: (2025)
Towards Resource-Efficient Serverless LLM Inference with SLINFER
by: Xu, Chuhao, et al.
Published: (2025)
by: Xu, Chuhao, et al.
Published: (2025)
Argus: Token Aware Distributed LLM Inference Optimization
by: Wu, Panlong, et al.
Published: (2025)
by: Wu, Panlong, et al.
Published: (2025)
Toward Sustainability-Aware LLM Inference on Edge Clusters
by: Rajashekar, Kolichala, et al.
Published: (2025)
by: Rajashekar, Kolichala, et al.
Published: (2025)
Distributed On-Device LLM Inference With Over-the-Air Computation
by: Zhang, Kai, et al.
Published: (2025)
by: Zhang, Kai, et al.
Published: (2025)
Efficient LLM Inference with Activation Checkpointing and Hybrid Caching
by: Lee, Sanghyeon, et al.
Published: (2025)
by: Lee, Sanghyeon, et al.
Published: (2025)
WANSpec: Leveraging Global Compute Capacity for LLM Inference
by: Martin, Noah, et al.
Published: (2026)
by: Martin, Noah, et al.
Published: (2026)
Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems
by: Wilkins, Grant, et al.
Published: (2024)
by: Wilkins, Grant, et al.
Published: (2024)
LLM-CoOpt: A Co-Design and Optimization Framework for Efficient LLM Inference on Heterogeneous Platforms
by: Kong, Jie, et al.
Published: (2026)
by: Kong, Jie, et al.
Published: (2026)
DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization
by: Tang, Zhenheng, et al.
Published: (2025)
by: Tang, Zhenheng, et al.
Published: (2025)
Distributed Load Balancing with Workload-Dependent Service Rates
by: Zhang, Wenxin, et al.
Published: (2024)
by: Zhang, Wenxin, et al.
Published: (2024)
Similar Items
-
SkyWalker: A Locality-Aware Cross-Region Load Balancer for LLM Inference
by: Xia, Tian, et al.
Published: (2025) -
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
by: Yüzügüler, Ahmet Caner, et al.
Published: (2025) -
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
by: Bu, Tianci, et al.
Published: (2026) -
Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving
by: Cheng, Ke, et al.
Published: (2024) -
Understanding the Performance and Power of LLM Inferencing on Edge Accelerators
by: Arya, Mayank, et al.
Published: (2025)