Saved in:
| Main Authors: | Yu, Shan, Xing, Jiarong, Qiao, Yifan, Ma, Mingyuan, Li, Yangmin, Wang, Yang, Yang, Shuo, Xie, Zhiqiang, Cao, Shiyi, Bao, Ke, Stoica, Ion, Xu, Harry, Sheng, Ying |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.04021 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving
by: Qiao, Yifan, et al.
Published: (2024)
by: Qiao, Yifan, et al.
Published: (2024)
Unleashing the Power of Preemptive Priority-based Scheduling for Real-Time GPU Tasks
by: Wang, Yidi, et al.
Published: (2024)
by: Wang, Yidi, et al.
Published: (2024)
The Energy Cost of Execution-Idle in GPU Clusters
by: Lei, Yiran, et al.
Published: (2026)
by: Lei, Yiran, et al.
Published: (2026)
Towards Efficient and Practical GPU Multitasking in the Era of LLM
by: Xing, Jiarong, et al.
Published: (2025)
by: Xing, Jiarong, et al.
Published: (2025)
FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines
by: He, Jiaao, et al.
Published: (2024)
by: He, Jiaao, et al.
Published: (2024)
Efficient GPU-Centered Singular Value Decomposition Using the Divide-and-Conquer Method
by: Liu, Shifang, et al.
Published: (2025)
by: Liu, Shifang, et al.
Published: (2025)
Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving
by: Sun, Tingyang, et al.
Published: (2026)
by: Sun, Tingyang, et al.
Published: (2026)
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
by: Jiang, Xuanlin, et al.
Published: (2024)
by: Jiang, Xuanlin, et al.
Published: (2024)
BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems
by: Wang, Yuxin, et al.
Published: (2024)
by: Wang, Yuxin, et al.
Published: (2024)
Fairness in Serving Large Language Models
by: Sheng, Ying, et al.
Published: (2023)
by: Sheng, Ying, et al.
Published: (2023)
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
by: Griggs, Tyler, et al.
Published: (2024)
by: Griggs, Tyler, et al.
Published: (2024)
UPMEM Unleashed: Software Secrets for Speed
by: Chmielewski, Krystian, et al.
Published: (2025)
by: Chmielewski, Krystian, et al.
Published: (2025)
Scalable GPU Performance Variability Analysis framework
by: Lahiry, Ankur, et al.
Published: (2025)
by: Lahiry, Ankur, et al.
Published: (2025)
On the Partitioning of GPU Power among Multi-Instances
by: Vamja, Tirth, et al.
Published: (2025)
by: Vamja, Tirth, et al.
Published: (2025)
BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures
by: Hu, Xiannan, et al.
Published: (2025)
by: Hu, Xiannan, et al.
Published: (2025)
Disaggregated Design for GPU-Based Volumetric Data Structures
by: Meneghin, Massimiliano, et al.
Published: (2025)
by: Meneghin, Massimiliano, et al.
Published: (2025)
Taking GPU Programming Models to Task for Performance Portability
by: Davis, Joshua H., et al.
Published: (2024)
by: Davis, Joshua H., et al.
Published: (2024)
Libra: Unleashing GPU Heterogeneity for High-Performance Sparse Matrix Multiplication
by: Shi, Jinliang, et al.
Published: (2025)
by: Shi, Jinliang, et al.
Published: (2025)
Profiling and optimization of multi-card GPU machine learning jobs
by: Lawenda, Marcin, et al.
Published: (2025)
by: Lawenda, Marcin, et al.
Published: (2025)
CUTHERMO: Understanding GPU Memory Inefficiencies with Heat Map Profiling
by: Zhao, Yanbo, et al.
Published: (2025)
by: Zhao, Yanbo, et al.
Published: (2025)
KEET: Explaining Performance of GPU Kernels Using LLM Agents
by: Davis, Joshua H., et al.
Published: (2026)
by: Davis, Joshua H., et al.
Published: (2026)
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
by: Jayakody, Shakya, et al.
Published: (2026)
by: Jayakody, Shakya, et al.
Published: (2026)
CARAT: Client-Side Adaptive RPC and Cache Co-Tuning for Parallel File Systems
by: Rashid, Md Hasanur, et al.
Published: (2026)
by: Rashid, Md Hasanur, et al.
Published: (2026)
High-Performance Portable GPU Primitives for Arbitrary Types and Operators in Julia
by: Pilliat, Emmanuel
Published: (2026)
by: Pilliat, Emmanuel
Published: (2026)
Efficient allocation of image recognition and LLM tasks on multi-GPU system
by: Lawenda, Marcin, et al.
Published: (2025)
by: Lawenda, Marcin, et al.
Published: (2025)
Data-Driven Analysis to Understand GPU Hardware Resource Usage of Optimizations
by: Islam, Tanzima Z., et al.
Published: (2024)
by: Islam, Tanzima Z., et al.
Published: (2024)
Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC
by: Siavashi, Mohammad, et al.
Published: (2026)
by: Siavashi, Mohammad, et al.
Published: (2026)
A Precision Emulation Approach to the GPU Acceleration of Ab Initio Electronic Structure Calculations
by: Liu, Hang, et al.
Published: (2026)
by: Liu, Hang, et al.
Published: (2026)
DIAL: Decentralized I/O AutoTuning via Learned Client-side Local Metrics for Parallel File System
by: Rashid, Md Hasanur, et al.
Published: (2026)
by: Rashid, Md Hasanur, et al.
Published: (2026)
Minos: Systematically Classifying Performance and Power Characteristics of GPU Workloads on HPC Clusters
by: Jain, Rutwik, et al.
Published: (2026)
by: Jain, Rutwik, et al.
Published: (2026)
Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs
by: Wahlgren, Jacob, et al.
Published: (2025)
by: Wahlgren, Jacob, et al.
Published: (2025)
LEO: Tracing GPU Stall Root Causes via Cross-Vendor Backward Slicing
by: Xia, Yuning, et al.
Published: (2026)
by: Xia, Yuning, et al.
Published: (2026)
Fast and Scalable Mixed Precision Euclidean Distance Calculations Using GPU Tensor Cores
by: Curless, Brian, et al.
Published: (2025)
by: Curless, Brian, et al.
Published: (2025)
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
by: Lin, Mao, et al.
Published: (2026)
by: Lin, Mao, et al.
Published: (2026)
The Landscape of GPU-Centric Communication
by: Unat, Didem, et al.
Published: (2024)
by: Unat, Didem, et al.
Published: (2024)
GigaAPI for GPU Parallelization
by: Suvarna, M., et al.
Published: (2025)
by: Suvarna, M., et al.
Published: (2025)
Parallelizing a modern GPU simulator
by: Huerta, Rodrigo, et al.
Published: (2025)
by: Huerta, Rodrigo, et al.
Published: (2025)
Towards Portability at Scale: A Cross-Architecture Performance Evaluation of a GPU-enabled Shallow Water Solver
by: Villalobos, Johansell, et al.
Published: (2025)
by: Villalobos, Johansell, et al.
Published: (2025)
EPIC: Efficient Position-Independent Caching for Serving Large Language Models
by: Hu, Junhao, et al.
Published: (2024)
by: Hu, Junhao, et al.
Published: (2024)
Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers
by: Maczan, Jędrzej
Published: (2026)
by: Maczan, Jędrzej
Published: (2026)
Similar Items
-
ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving
by: Qiao, Yifan, et al.
Published: (2024) -
Unleashing the Power of Preemptive Priority-based Scheduling for Real-Time GPU Tasks
by: Wang, Yidi, et al.
Published: (2024) -
The Energy Cost of Execution-Idle in GPU Clusters
by: Lei, Yiran, et al.
Published: (2026) -
Towards Efficient and Practical GPU Multitasking in the Era of LLM
by: Xing, Jiarong, et al.
Published: (2025) -
FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines
by: He, Jiaao, et al.
Published: (2024)