Saved in:
| Main Authors: | Wilkins, Grant, Keshav, Srinivasan, Mortier, Richard |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.04014 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads
by: Wilkins, Grant, et al.
Published: (2024)
by: Wilkins, Grant, et al.
Published: (2024)
From Servers to Sites: Compositional Power Trace Generation of LLM Inference for Infrastructure Planning
by: Wilkins, Grant, et al.
Published: (2026)
by: Wilkins, Grant, et al.
Published: (2026)
FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs
by: He, Xuan, et al.
Published: (2025)
by: He, Xuan, et al.
Published: (2025)
Cloud Native System for LLM Inference Serving
by: Xu, Minxian, et al.
Published: (2025)
by: Xu, Minxian, et al.
Published: (2025)
GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads
by: Ye, Fanjiang, et al.
Published: (2026)
by: Ye, Fanjiang, et al.
Published: (2026)
GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources
by: Du, Boxiao, et al.
Published: (2026)
by: Du, Boxiao, et al.
Published: (2026)
OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration
by: Jiang, Youhe, et al.
Published: (2026)
by: Jiang, Youhe, et al.
Published: (2026)
Large-Scale LLM Inference with Heterogeneous Workloads: Prefill-Decode Contention and Asymptotically Optimal Control
by: Lin, Ruihan, et al.
Published: (2026)
by: Lin, Ruihan, et al.
Published: (2026)
DeServe: Towards Affordable Offline LLM Inference via Decentralization
by: Wu, Linyu, et al.
Published: (2025)
by: Wu, Linyu, et al.
Published: (2025)
SYMPHONY: Improving Memory Management for LLM Inference Workloads
by: Agarwal, Saurabh, et al.
Published: (2024)
by: Agarwal, Saurabh, et al.
Published: (2024)
BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems
by: Wang, Yuxin, et al.
Published: (2024)
by: Wang, Yuxin, et al.
Published: (2024)
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
by: Hu, Cunchen, et al.
Published: (2024)
by: Hu, Cunchen, et al.
Published: (2024)
OCTOPINF: Workload-Aware Inference Serving for Edge Video Analytics
by: Nguyen, Thanh-Tung, et al.
Published: (2025)
by: Nguyen, Thanh-Tung, et al.
Published: (2025)
Jenga: Effective Memory Management for Serving LLM with Heterogeneity
by: Zhang, Chen, et al.
Published: (2025)
by: Zhang, Chen, et al.
Published: (2025)
OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving
by: Wu, Siyu, et al.
Published: (2025)
by: Wu, Siyu, et al.
Published: (2025)
DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance
by: Zhang, Yuning, et al.
Published: (2025)
by: Zhang, Yuning, et al.
Published: (2025)
BrownoutServe: SLO-Aware Inference Serving under Bursty Workloads for MoE-based LLMs
by: Hu, Jianmin, et al.
Published: (2025)
by: Hu, Jianmin, et al.
Published: (2025)
Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
by: Jiang, Youhe, et al.
Published: (2025)
by: Jiang, Youhe, et al.
Published: (2025)
ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production
by: Xiang, Yuxing, et al.
Published: (2025)
by: Xiang, Yuxing, et al.
Published: (2025)
FedSZ: Leveraging Error-Bounded Lossy Compression for Federated Learning Communications
by: Wilkins, Grant, et al.
Published: (2023)
by: Wilkins, Grant, et al.
Published: (2023)
UELLM: A Unified and Efficient Approach for LLM Inference Serving
by: He, Yiyuan, et al.
Published: (2024)
by: He, Yiyuan, et al.
Published: (2024)
Efficient Multi-round LLM Inference over Disaggregated Serving
by: He, Wenhao, et al.
Published: (2026)
by: He, Wenhao, et al.
Published: (2026)
Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
by: Hidayetoglu, Mert, et al.
Published: (2025)
by: Hidayetoglu, Mert, et al.
Published: (2025)
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
by: Li, Suyi, et al.
Published: (2024)
by: Li, Suyi, et al.
Published: (2024)
Decentralized LLM Inference over Edge Networks with Energy Harvesting
by: Khoshsirat, Aria, et al.
Published: (2024)
by: Khoshsirat, Aria, et al.
Published: (2024)
DualScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS
by: Basit, Omar, et al.
Published: (2026)
by: Basit, Omar, et al.
Published: (2026)
Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems
by: Jaiswal, Shashwat, et al.
Published: (2025)
by: Jaiswal, Shashwat, et al.
Published: (2025)
PICE: A Semantic-Driven Progressive Inference System for LLM Serving in Cloud-Edge Networks
by: Zhan, Huiyou, et al.
Published: (2025)
by: Zhan, Huiyou, et al.
Published: (2025)
To Compress or Not To Compress: Energy Trade-Offs and Benefits of Lossy Compressed I/O
by: Wilkins, Grant, et al.
Published: (2024)
by: Wilkins, Grant, et al.
Published: (2024)
ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving
by: Qiao, Yifan, et al.
Published: (2024)
by: Qiao, Yifan, et al.
Published: (2024)
CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands
by: Wang, Weiye, et al.
Published: (2026)
by: Wang, Weiye, et al.
Published: (2026)
PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers
by: Zhang, Hongbin, et al.
Published: (2026)
by: Zhang, Hongbin, et al.
Published: (2026)
SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference
by: Zhao, Alan, et al.
Published: (2026)
by: Zhao, Alan, et al.
Published: (2026)
Federated Inference for Heterogeneous LLM Communication and Collaboration
by: Chen, Zihan, et al.
Published: (2026)
by: Chen, Zihan, et al.
Published: (2026)
Quantifying the Energy Consumption and Carbon Emissions of LLM Inference via Simulations
by: Özcan, Miray, et al.
Published: (2025)
by: Özcan, Miray, et al.
Published: (2025)
BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving
by: Zheng, Wanyi, et al.
Published: (2025)
by: Zheng, Wanyi, et al.
Published: (2025)
Collaborative Speculative Inference for Efficient LLM Inference Serving
by: Gao, Luyao, et al.
Published: (2025)
by: Gao, Luyao, et al.
Published: (2025)
Accelerating Compound LLM Training Workloads with Maestro
by: Yuan, Xiulong, et al.
Published: (2026)
by: Yuan, Xiulong, et al.
Published: (2026)
Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures
by: Argerich, Mauricio Fadel, et al.
Published: (2026)
by: Argerich, Mauricio Fadel, et al.
Published: (2026)
HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling
by: Peng, You, et al.
Published: (2026)
by: Peng, You, et al.
Published: (2026)
Similar Items
-
Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads
by: Wilkins, Grant, et al.
Published: (2024) -
From Servers to Sites: Compositional Power Trace Generation of LLM Inference for Infrastructure Planning
by: Wilkins, Grant, et al.
Published: (2026) -
FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs
by: He, Xuan, et al.
Published: (2025) -
Cloud Native System for LLM Inference Serving
by: Xu, Minxian, et al.
Published: (2025) -
GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads
by: Ye, Fanjiang, et al.
Published: (2026)