Saved in:
| Main Authors: | Levine, Reese, Sharma, Rithik, Jain, Nikhil, Ramesh, Abhijit, Chen, Zheyuan, Abbas, Neha, Contini, James, Sorensen, Tyler |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.20706 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers
by: Maczan, Jędrzej
Published: (2026)
by: Maczan, Jędrzej
Published: (2026)
LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory
by: Sorensen, Tyler, et al.
Published: (2024)
by: Sorensen, Tyler, et al.
Published: (2024)
Efficient and Portable Support for Overdecomposition on Distributed Memory GPGPU Platforms
by: Bhosale, Aditya, et al.
Published: (2026)
by: Bhosale, Aditya, et al.
Published: (2026)
KEET: Explaining Performance of GPU Kernels Using LLM Agents
by: Davis, Joshua H., et al.
Published: (2026)
by: Davis, Joshua H., et al.
Published: (2026)
Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads
by: Scheinert, Dominik, et al.
Published: (2026)
by: Scheinert, Dominik, et al.
Published: (2026)
Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision
by: Ringoot, Evelyne, et al.
Published: (2025)
by: Ringoot, Evelyne, et al.
Published: (2025)
Taking GPU Programming Models to Task for Performance Portability
by: Davis, Joshua H., et al.
Published: (2024)
by: Davis, Joshua H., et al.
Published: (2024)
Mewz: Lightweight Execution Environment for WebAssembly with High Isolation and Portability using Unikernels
by: Ueda, Soichiro, et al.
Published: (2024)
by: Ueda, Soichiro, et al.
Published: (2024)
DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference
by: Lin, Shouxu, et al.
Published: (2026)
by: Lin, Shouxu, et al.
Published: (2026)
High-Performance Portable GPU Primitives for Arbitrary Types and Operators in Julia
by: Pilliat, Emmanuel
Published: (2026)
by: Pilliat, Emmanuel
Published: (2026)
Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems
by: Huang, En-Ming, et al.
Published: (2025)
by: Huang, En-Ming, et al.
Published: (2025)
HarMoEny: Efficient Multi-GPU Inference of MoE Models
by: Doucet, Zachary, et al.
Published: (2025)
by: Doucet, Zachary, et al.
Published: (2025)
Portability Efficiency Approach for Calculating Performance Portability
by: Marowka, Ami
Published: (2024)
by: Marowka, Ami
Published: (2024)
Understanding the Landscape of Ampere GPU Memory Errors
by: Zhu, Zhu, et al.
Published: (2025)
by: Zhu, Zhu, et al.
Published: (2025)
Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers
by: Maurya, Avinash, et al.
Published: (2024)
by: Maurya, Avinash, et al.
Published: (2024)
Web3DB: Web 3.0 RDBMS for Individual Data Ownership
by: Mukherjee, Shankha Shubhra, et al.
Published: (2025)
by: Mukherjee, Shankha Shubhra, et al.
Published: (2025)
Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM
by: Carrica, Vicki, et al.
Published: (2025)
by: Carrica, Vicki, et al.
Published: (2025)
Portable GPU implementation of the WP-CCC ion-atom collisions code
by: Abdurakhmanov, I. B., et al.
Published: (2024)
by: Abdurakhmanov, I. B., et al.
Published: (2024)
Accelerating Loading WebGraphs in ParaGrapher
by: Esfahani, Mohsen Koohi
Published: (2025)
by: Esfahani, Mohsen Koohi
Published: (2025)
ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments
by: Lee, Munkyu, et al.
Published: (2024)
by: Lee, Munkyu, et al.
Published: (2024)
Byzantine-Tolerant Consensus in GPU-Inspired Shared Memory
by: Georgiou, Chryssis, et al.
Published: (2025)
by: Georgiou, Chryssis, et al.
Published: (2025)
Experimental Analysis of Server-Side Caching for Web Performance
by: Umar, Mohammad, et al.
Published: (2026)
by: Umar, Mohammad, et al.
Published: (2026)
Anywhere: A Web Crawler Automation Management Interface
by: Lin, Jinwei
Published: (2024)
by: Lin, Jinwei
Published: (2024)
PilotANN: Memory-Bounded GPU Acceleration for Vector Search
by: Gui, Yuntao, et al.
Published: (2025)
by: Gui, Yuntao, et al.
Published: (2025)
Implementing Multi-GPU Scientific Computing Miniapps Across Performance Portable Frameworks
by: Villalobos, Johansell, et al.
Published: (2025)
by: Villalobos, Johansell, et al.
Published: (2025)
HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences
by: Gu, Jianfeng, et al.
Published: (2025)
by: Gu, Jianfeng, et al.
Published: (2025)
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
by: Recasens, Pol G., et al.
Published: (2025)
by: Recasens, Pol G., et al.
Published: (2025)
Towards Portability at Scale: A Cross-Architecture Performance Evaluation of a GPU-enabled Shallow Water Solver
by: Villalobos, Johansell, et al.
Published: (2025)
by: Villalobos, Johansell, et al.
Published: (2025)
GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching
by: Guo, Cong, et al.
Published: (2024)
by: Guo, Cong, et al.
Published: (2024)
Multi-Partner Project: Multi-GPU Performance Portability Analysis for CFD Simulations at Scale
by: Eleftherakis, Panagiotis-Eleftherios, et al.
Published: (2026)
by: Eleftherakis, Panagiotis-Eleftherios, et al.
Published: (2026)
PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters
by: Jain, Rutwik, et al.
Published: (2024)
by: Jain, Rutwik, et al.
Published: (2024)
PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers
by: Zhang, Hongbin, et al.
Published: (2026)
by: Zhang, Hongbin, et al.
Published: (2026)
GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations
by: Yousefzadeh-Asl-Miandoab, Ehsan, et al.
Published: (2026)
by: Yousefzadeh-Asl-Miandoab, Ehsan, et al.
Published: (2026)
AQUA: Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU Domains
by: Kumar, Abhishek Vijaya, et al.
Published: (2024)
by: Kumar, Abhishek Vijaya, et al.
Published: (2024)
Roadrunner: Accelerating Data Delivery to WebAssembly-Based Serverless Functions
by: Marcelino, Cynthia, et al.
Published: (2025)
by: Marcelino, Cynthia, et al.
Published: (2025)
Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference
by: Yu, Minchen, et al.
Published: (2023)
by: Yu, Minchen, et al.
Published: (2023)
Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling
by: Ahmad, Sohaib, et al.
Published: (2024)
by: Ahmad, Sohaib, et al.
Published: (2024)
GPU-accelerated Multi-relational Parallel Graph Retrieval for Web-scale Recommendations
by: Guo, Zhuoning, et al.
Published: (2025)
by: Guo, Zhuoning, et al.
Published: (2025)
Frenzy: A Memory-Aware Serverless LLM Training System for Heterogeneous GPU Clusters
by: Chang, Zihan, et al.
Published: (2024)
by: Chang, Zihan, et al.
Published: (2024)
KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling
by: Zhang, Guilin, et al.
Published: (2025)
by: Zhang, Guilin, et al.
Published: (2025)
Similar Items
-
Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers
by: Maczan, Jędrzej
Published: (2026) -
LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory
by: Sorensen, Tyler, et al.
Published: (2024) -
Efficient and Portable Support for Overdecomposition on Distributed Memory GPGPU Platforms
by: Bhosale, Aditya, et al.
Published: (2026) -
KEET: Explaining Performance of GPU Kernels Using LLM Agents
by: Davis, Joshua H., et al.
Published: (2026) -
Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads
by: Scheinert, Dominik, et al.
Published: (2026)