:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Yu, Shan, Xing, Jiarong, Qiao, Yifan, Ma, Mingyuan, Li, Yangmin, Wang, Yang, Yang, Shuo, Xie, Zhiqiang, Cao, Shiyi, Bao, Ke, Stoica, Ion, Xu, Harry, Sheng, Ying
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence Machine Learning Performance
Online Access:	https://arxiv.org/abs/2505.04021
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving
by: Qiao, Yifan, et al.
Published: (2024)

Unleashing the Power of Preemptive Priority-based Scheduling for Real-Time GPU Tasks
by: Wang, Yidi, et al.
Published: (2024)

The Energy Cost of Execution-Idle in GPU Clusters
by: Lei, Yiran, et al.
Published: (2026)

Towards Efficient and Practical GPU Multitasking in the Era of LLM
by: Xing, Jiarong, et al.
Published: (2025)

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines
by: He, Jiaao, et al.
Published: (2024)

Efficient GPU-Centered Singular Value Decomposition Using the Divide-and-Conquer Method
by: Liu, Shifang, et al.
Published: (2025)

Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving
by: Sun, Tingyang, et al.
Published: (2026)

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
by: Jiang, Xuanlin, et al.
Published: (2024)

BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems
by: Wang, Yuxin, et al.
Published: (2024)

Fairness in Serving Large Language Models
by: Sheng, Ying, et al.
Published: (2023)

Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
by: Griggs, Tyler, et al.
Published: (2024)

UPMEM Unleashed: Software Secrets for Speed
by: Chmielewski, Krystian, et al.
Published: (2025)

Scalable GPU Performance Variability Analysis framework
by: Lahiry, Ankur, et al.
Published: (2025)

On the Partitioning of GPU Power among Multi-Instances
by: Vamja, Tirth, et al.
Published: (2025)

BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures
by: Hu, Xiannan, et al.
Published: (2025)

Disaggregated Design for GPU-Based Volumetric Data Structures
by: Meneghin, Massimiliano, et al.
Published: (2025)

Taking GPU Programming Models to Task for Performance Portability
by: Davis, Joshua H., et al.
Published: (2024)

Libra: Unleashing GPU Heterogeneity for High-Performance Sparse Matrix Multiplication
by: Shi, Jinliang, et al.
Published: (2025)

Profiling and optimization of multi-card GPU machine learning jobs
by: Lawenda, Marcin, et al.
Published: (2025)

CUTHERMO: Understanding GPU Memory Inefficiencies with Heat Map Profiling
by: Zhao, Yanbo, et al.
Published: (2025)

KEET: Explaining Performance of GPU Kernels Using LLM Agents
by: Davis, Joshua H., et al.
Published: (2026)

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
by: Jayakody, Shakya, et al.
Published: (2026)

CARAT: Client-Side Adaptive RPC and Cache Co-Tuning for Parallel File Systems
by: Rashid, Md Hasanur, et al.
Published: (2026)

High-Performance Portable GPU Primitives for Arbitrary Types and Operators in Julia
by: Pilliat, Emmanuel
Published: (2026)

Efficient allocation of image recognition and LLM tasks on multi-GPU system
by: Lawenda, Marcin, et al.
Published: (2025)

Data-Driven Analysis to Understand GPU Hardware Resource Usage of Optimizations
by: Islam, Tanzima Z., et al.
Published: (2024)

Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC
by: Siavashi, Mohammad, et al.
Published: (2026)

A Precision Emulation Approach to the GPU Acceleration of Ab Initio Electronic Structure Calculations
by: Liu, Hang, et al.
Published: (2026)

DIAL: Decentralized I/O AutoTuning via Learned Client-side Local Metrics for Parallel File System
by: Rashid, Md Hasanur, et al.
Published: (2026)

Minos: Systematically Classifying Performance and Power Characteristics of GPU Workloads on HPC Clusters
by: Jain, Rutwik, et al.
Published: (2026)

Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs
by: Wahlgren, Jacob, et al.
Published: (2025)

LEO: Tracing GPU Stall Root Causes via Cross-Vendor Backward Slicing
by: Xia, Yuning, et al.
Published: (2026)

Fast and Scalable Mixed Precision Euclidean Distance Calculations Using GPU Tensor Cores
by: Curless, Brian, et al.
Published: (2025)

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
by: Lin, Mao, et al.
Published: (2026)

The Landscape of GPU-Centric Communication
by: Unat, Didem, et al.
Published: (2024)

GigaAPI for GPU Parallelization
by: Suvarna, M., et al.
Published: (2025)

Parallelizing a modern GPU simulator
by: Huerta, Rodrigo, et al.
Published: (2025)

Towards Portability at Scale: A Cross-Architecture Performance Evaluation of a GPU-enabled Shallow Water Solver
by: Villalobos, Johansell, et al.
Published: (2025)

EPIC: Efficient Position-Independent Caching for Serving Large Language Models
by: Hu, Junhao, et al.
Published: (2024)

Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers
by: Maczan, Jędrzej
Published: (2026)