:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Levine, Reese, Sharma, Rithik, Jain, Nikhil, Ramesh, Abhijit, Chen, Zheyuan, Abbas, Neha, Contini, James, Sorensen, Tyler
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2605.20706
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers
by: Maczan, Jędrzej
Published: (2026)

LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory
by: Sorensen, Tyler, et al.
Published: (2024)

Efficient and Portable Support for Overdecomposition on Distributed Memory GPGPU Platforms
by: Bhosale, Aditya, et al.
Published: (2026)

KEET: Explaining Performance of GPU Kernels Using LLM Agents
by: Davis, Joshua H., et al.
Published: (2026)

Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads
by: Scheinert, Dominik, et al.
Published: (2026)

Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision
by: Ringoot, Evelyne, et al.
Published: (2025)

Taking GPU Programming Models to Task for Performance Portability
by: Davis, Joshua H., et al.
Published: (2024)

Mewz: Lightweight Execution Environment for WebAssembly with High Isolation and Portability using Unikernels
by: Ueda, Soichiro, et al.
Published: (2024)

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference
by: Lin, Shouxu, et al.
Published: (2026)

High-Performance Portable GPU Primitives for Arbitrary Types and Operators in Julia
by: Pilliat, Emmanuel
Published: (2026)

Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems
by: Huang, En-Ming, et al.
Published: (2025)

HarMoEny: Efficient Multi-GPU Inference of MoE Models
by: Doucet, Zachary, et al.
Published: (2025)

Portability Efficiency Approach for Calculating Performance Portability
by: Marowka, Ami
Published: (2024)

Understanding the Landscape of Ampere GPU Memory Errors
by: Zhu, Zhu, et al.
Published: (2025)

Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers
by: Maurya, Avinash, et al.
Published: (2024)

Web3DB: Web 3.0 RDBMS for Individual Data Ownership
by: Mukherjee, Shankha Shubhra, et al.
Published: (2025)

Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM
by: Carrica, Vicki, et al.
Published: (2025)

Portable GPU implementation of the WP-CCC ion-atom collisions code
by: Abdurakhmanov, I. B., et al.
Published: (2024)

Accelerating Loading WebGraphs in ParaGrapher
by: Esfahani, Mohsen Koohi
Published: (2025)

ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments
by: Lee, Munkyu, et al.
Published: (2024)

Byzantine-Tolerant Consensus in GPU-Inspired Shared Memory
by: Georgiou, Chryssis, et al.
Published: (2025)

Experimental Analysis of Server-Side Caching for Web Performance
by: Umar, Mohammad, et al.
Published: (2026)

Anywhere: A Web Crawler Automation Management Interface
by: Lin, Jinwei
Published: (2024)

PilotANN: Memory-Bounded GPU Acceleration for Vector Search
by: Gui, Yuntao, et al.
Published: (2025)

Implementing Multi-GPU Scientific Computing Miniapps Across Performance Portable Frameworks
by: Villalobos, Johansell, et al.
Published: (2025)

HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences
by: Gu, Jianfeng, et al.
Published: (2025)

Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
by: Recasens, Pol G., et al.
Published: (2025)

Towards Portability at Scale: A Cross-Architecture Performance Evaluation of a GPU-enabled Shallow Water Solver
by: Villalobos, Johansell, et al.
Published: (2025)

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching
by: Guo, Cong, et al.
Published: (2024)

Multi-Partner Project: Multi-GPU Performance Portability Analysis for CFD Simulations at Scale
by: Eleftherakis, Panagiotis-Eleftherios, et al.
Published: (2026)

PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters
by: Jain, Rutwik, et al.
Published: (2024)

PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers
by: Zhang, Hongbin, et al.
Published: (2026)

GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations
by: Yousefzadeh-Asl-Miandoab, Ehsan, et al.
Published: (2026)

AQUA: Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU Domains
by: Kumar, Abhishek Vijaya, et al.
Published: (2024)

Roadrunner: Accelerating Data Delivery to WebAssembly-Based Serverless Functions
by: Marcelino, Cynthia, et al.
Published: (2025)

Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference
by: Yu, Minchen, et al.
Published: (2023)

Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling
by: Ahmad, Sohaib, et al.
Published: (2024)

GPU-accelerated Multi-relational Parallel Graph Retrieval for Web-scale Recommendations
by: Guo, Zhuoning, et al.
Published: (2025)

Frenzy: A Memory-Aware Serverless LLM Training System for Heterogeneous GPU Clusters
by: Chang, Zihan, et al.
Published: (2024)

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling
by: Zhang, Guilin, et al.
Published: (2025)