:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wilkins, Grant, Keshav, Srinivasan, Mortier, Richard
Format:	Preprint
Published:	2024
Subjects:	Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2407.04014
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads
by: Wilkins, Grant, et al.
Published: (2024)

From Servers to Sites: Compositional Power Trace Generation of LLM Inference for Infrastructure Planning
by: Wilkins, Grant, et al.
Published: (2026)

FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs
by: He, Xuan, et al.
Published: (2025)

Cloud Native System for LLM Inference Serving
by: Xu, Minxian, et al.
Published: (2025)

GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads
by: Ye, Fanjiang, et al.
Published: (2026)

GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources
by: Du, Boxiao, et al.
Published: (2026)

OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration
by: Jiang, Youhe, et al.
Published: (2026)

Large-Scale LLM Inference with Heterogeneous Workloads: Prefill-Decode Contention and Asymptotically Optimal Control
by: Lin, Ruihan, et al.
Published: (2026)

DeServe: Towards Affordable Offline LLM Inference via Decentralization
by: Wu, Linyu, et al.
Published: (2025)

SYMPHONY: Improving Memory Management for LLM Inference Workloads
by: Agarwal, Saurabh, et al.
Published: (2024)

BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems
by: Wang, Yuxin, et al.
Published: (2024)

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
by: Hu, Cunchen, et al.
Published: (2024)

OCTOPINF: Workload-Aware Inference Serving for Edge Video Analytics
by: Nguyen, Thanh-Tung, et al.
Published: (2025)

Jenga: Effective Memory Management for Serving LLM with Heterogeneity
by: Zhang, Chen, et al.
Published: (2025)

OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving
by: Wu, Siyu, et al.
Published: (2025)

DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance
by: Zhang, Yuning, et al.
Published: (2025)

BrownoutServe: SLO-Aware Inference Serving under Bursty Workloads for MoE-based LLMs
by: Hu, Jianmin, et al.
Published: (2025)

Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
by: Jiang, Youhe, et al.
Published: (2025)

ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production
by: Xiang, Yuxing, et al.
Published: (2025)

FedSZ: Leveraging Error-Bounded Lossy Compression for Federated Learning Communications
by: Wilkins, Grant, et al.
Published: (2023)

UELLM: A Unified and Efficient Approach for LLM Inference Serving
by: He, Yiyuan, et al.
Published: (2024)

Efficient Multi-round LLM Inference over Disaggregated Serving
by: He, Wenhao, et al.
Published: (2026)

Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
by: Hidayetoglu, Mert, et al.
Published: (2025)

CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
by: Li, Suyi, et al.
Published: (2024)

Decentralized LLM Inference over Edge Networks with Energy Harvesting
by: Khoshsirat, Aria, et al.
Published: (2024)

DualScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS
by: Basit, Omar, et al.
Published: (2026)

Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems
by: Jaiswal, Shashwat, et al.
Published: (2025)

PICE: A Semantic-Driven Progressive Inference System for LLM Serving in Cloud-Edge Networks
by: Zhan, Huiyou, et al.
Published: (2025)

To Compress or Not To Compress: Energy Trade-Offs and Benefits of Lossy Compressed I/O
by: Wilkins, Grant, et al.
Published: (2024)

ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving
by: Qiao, Yifan, et al.
Published: (2024)

CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands
by: Wang, Weiye, et al.
Published: (2026)

PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers
by: Zhang, Hongbin, et al.
Published: (2026)

SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference
by: Zhao, Alan, et al.
Published: (2026)

Federated Inference for Heterogeneous LLM Communication and Collaboration
by: Chen, Zihan, et al.
Published: (2026)

Quantifying the Energy Consumption and Carbon Emissions of LLM Inference via Simulations
by: Özcan, Miray, et al.
Published: (2025)

BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving
by: Zheng, Wanyi, et al.
Published: (2025)

Collaborative Speculative Inference for Efficient LLM Inference Serving
by: Gao, Luyao, et al.
Published: (2025)

Accelerating Compound LLM Training Workloads with Maestro
by: Yuan, Xiulong, et al.
Published: (2026)

Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures
by: Argerich, Mauricio Fadel, et al.
Published: (2026)

HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling
by: Peng, You, et al.
Published: (2026)