:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Bournias, Ilias, Cavigelli, Lukas, Zacharopoulos, Georgios
Format:	Preprint
Published:	2024
Subjects:	Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2411.05555
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SkyWalker: A Locality-Aware Cross-Region Load Balancer for LLM Inference
by: Xia, Tian, et al.
Published: (2025)

PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
by: Yüzügüler, Ahmet Caner, et al.
Published: (2025)

Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
by: Bu, Tianci, et al.
Published: (2026)

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving
by: Cheng, Ke, et al.
Published: (2024)

Understanding the Performance and Power of LLM Inferencing on Edge Accelerators
by: Arya, Mayank, et al.
Published: (2025)

DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving
by: Yuan, Ying, et al.
Published: (2026)

CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing
by: Yuan, Yitao, et al.
Published: (2025)

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
by: Kim, Joon Ha, et al.
Published: (2026)

S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance
by: Liu, Di, et al.
Published: (2026)

LIME:Accelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices
by: Sun, Mingyu, et al.
Published: (2025)

ReaLB: Real-Time Load Balancing for Multimodal MoE Inference
by: Wang, Yingping, et al.
Published: (2026)

LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind
by: Zhang, Li, et al.
Published: (2025)

Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling
by: Da, Wei, et al.
Published: (2025)

Accelerating LLM Inference with Precomputed Query Storage
by: Park, Jay H., et al.
Published: (2025)

SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference
by: Zhao, Alan, et al.
Published: (2026)

gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
by: Guo, Tianyu, et al.
Published: (2025)

Ethereal: Divide and Conquer Network Load Balancing in Large-Scale Distributed Training
by: Addanki, Vamsi, et al.
Published: (2024)

Accelerating Compound LLM Training Workloads with Maestro
by: Yuan, Xiulong, et al.
Published: (2026)

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing
by: Jain, Kunal, et al.
Published: (2024)

LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling
by: Da, Wei, et al.
Published: (2026)

VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
by: Liu, Zihan, et al.
Published: (2025)

Cloud Native System for LLM Inference Serving
by: Xu, Minxian, et al.
Published: (2025)

Enabling Dynamic Sparsity in Quantized LLM Inference
by: Wang, Rongxiang, et al.
Published: (2025)

Federated Inference for Heterogeneous LLM Communication and Collaboration
by: Chen, Zihan, et al.
Published: (2026)

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
by: Hu, Cunchen, et al.
Published: (2024)

Fail-Closed Lowering of Resident KV Claims onto LLM Serving Runtimes
by: Stepanek, Lukas
Published: (2026)

LAAFD: LLM-based Agents for Accelerated FPGA Design
by: Moraru, Maxim, et al.
Published: (2026)

PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
by: Butler, Branden, et al.
Published: (2024)

SYMPHONY: Improving Memory Management for LLM Inference Workloads
by: Agarwal, Saurabh, et al.
Published: (2024)

From Attention to Disaggregation: Tracing the Evolution of LLM Inference
by: Kumar, Madabattula Rajesh, et al.
Published: (2025)

Towards Resource-Efficient Serverless LLM Inference with SLINFER
by: Xu, Chuhao, et al.
Published: (2025)

Argus: Token Aware Distributed LLM Inference Optimization
by: Wu, Panlong, et al.
Published: (2025)

Toward Sustainability-Aware LLM Inference on Edge Clusters
by: Rajashekar, Kolichala, et al.
Published: (2025)

Distributed On-Device LLM Inference With Over-the-Air Computation
by: Zhang, Kai, et al.
Published: (2025)

Efficient LLM Inference with Activation Checkpointing and Hybrid Caching
by: Lee, Sanghyeon, et al.
Published: (2025)

WANSpec: Leveraging Global Compute Capacity for LLM Inference
by: Martin, Noah, et al.
Published: (2026)

Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems
by: Wilkins, Grant, et al.
Published: (2024)

LLM-CoOpt: A Co-Design and Optimization Framework for Efficient LLM Inference on Heterogeneous Platforms
by: Kong, Jie, et al.
Published: (2026)

DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization
by: Tang, Zhenheng, et al.
Published: (2025)

Distributed Load Balancing with Workload-Dependent Service Rates
by: Zhang, Wenxin, et al.
Published: (2024)