:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Cionca, Victor, Szabo, Ferenc, Vasilev, Stanimir, Smyth, Dylan
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing Performance
Online Access:	https://arxiv.org/abs/2604.04498
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Preliminary report: Initial evaluation of StdPar implementations on AMD GPUs for HPC
by: Lin, Wei-Chen, et al.
Published: (2024)

Hardware-Agnostic and Insightful Efficiency Metrics for Accelerated Systems: Definition and Implementation within TALP
by: Rahimi, Ghazal, et al.
Published: (2026)

Cloud Performance Decomposition for Long-Term Performance Engineering: A Case Study
by: Debnath, Shimul, et al.
Published: (2026)

Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs
by: Ng, Nathan, et al.
Published: (2026)

Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving
by: Sun, Tingyang, et al.
Published: (2026)

Minos: Systematically Classifying Performance and Power Characteristics of GPU Workloads on HPC Clusters
by: Jain, Rutwik, et al.
Published: (2026)

Beyond Thread States: Diagnosing Performance Degradation with eBPF and Thread Dynamics
by: Landau, Diogo, et al.
Published: (2026)

QoSFlow: Ensuring Service Quality of Distributed Workflows Using Interpretable Sensitivity Models
by: Rashid, Md Hasanur, et al.
Published: (2026)

High-Performance Portable GPU Primitives for Arbitrary Types and Operators in Julia
by: Pilliat, Emmanuel
Published: (2026)

Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles
by: Arif, Moiz, et al.
Published: (2026)

Performance Optimization in Stream Processing Systems: Experiment-Driven Configuration Tuning for Kafka Streams
by: Chen, David, et al.
Published: (2026)

PASTA: A Modular Program Analysis Tool Framework for Accelerators
by: Lin, Mao, et al.
Published: (2026)

CARAT: Client-Side Adaptive RPC and Cache Co-Tuning for Parallel File Systems
by: Rashid, Md Hasanur, et al.
Published: (2026)

The Energy Cost of Execution-Idle in GPU Clusters
by: Lei, Yiran, et al.
Published: (2026)

ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations
by: Boudaoud, Afif, et al.
Published: (2026)

Modeling the Impact of Fiber Latency on Compute-Communication Overlap in Geo-Distributed Multi-Datacenter AI Training
by: Papavasileiou, Ioannis, et al.
Published: (2026)

Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics
by: Grbic, Dragana
Published: (2026)

Learning-Augmented Performance Model for Tensor Product Factorization in High-Order FEM
by: Ren, Xuanzhengbo, et al.
Published: (2026)

LEO: Tracing GPU Stall Root Causes via Cross-Vendor Backward Slicing
by: Xia, Yuning, et al.
Published: (2026)

DataStates-LLM: Scalable Checkpointing for Transformer Models Using Composable State Providers
by: Maurya, Avinash, et al.
Published: (2026)

DIAL: Decentralized I/O AutoTuning via Learned Client-side Local Metrics for Parallel File System
by: Rashid, Md Hasanur, et al.
Published: (2026)

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
by: Lin, Mao, et al.
Published: (2026)

A Multi-Port Concurrent Communication Model for handling Compute Intensive Tasks on Distributed Satellite System Constellations
by: Veeravalli, Bharadwaj
Published: (2026)

Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems
by: MacLachlan, Glen, et al.
Published: (2026)

FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow
by: Heidari, Sina, et al.
Published: (2026)

KEET: Explaining Performance of GPU Kernels Using LLM Agents
by: Davis, Joshua H., et al.
Published: (2026)

Shifting the Sweet Spot: High-Performance Matrix-Free Method for High-Order Elasticity
by: Chang, Dali, et al.
Published: (2026)

Comparing the Performance of Heterogeneous Conjugate Gradient and Cholesky Solvers on Various Hardware Using SYCL
by: Thüring, Tim, et al.
Published: (2026)

Energy-Aware Computing in the Year 2026
by: Tchakoute, Roblex Nana, et al.
Published: (2026)

Communication-Aware Diffusion Load Balancing for Persistently Interacting Objects
by: Taylor, Maya, et al.
Published: (2026)

Extracting Practical, Actionable Energy Insights from Supercomputer Telemetry and Logs
by: Cornelius, Melanie, et al.
Published: (2025)

Scaling Large-scale GNN Training to Thousands of Processors on CPU-based Supercomputers
by: Zhuang, Chen, et al.
Published: (2024)

Profiling and optimization of multi-card GPU machine learning jobs
by: Lawenda, Marcin, et al.
Published: (2025)

Optimal Parallel Scheduling under Concave Speedup Functions
by: Li, Chengzhang, et al.
Published: (2025)

WebAssembly and Unikernels: A Comparative Study for Serverless at the Edge
by: Besozzi, Valerio, et al.
Published: (2025)

Efficient GPU-Centered Singular Value Decomposition Using the Divide-and-Conquer Method
by: Liu, Shifang, et al.
Published: (2025)

Resource Management Schemes for Cloud-Native Platforms with Computing Containers of Docker and Kubernetes
by: Mao, Ying, et al.
Published: (2020)

Staging Blocked Evaluation over Structured Sparse Matrices
by: Das, Pratyush, et al.
Published: (2024)

Reducing Tail Latencies Through Environment- and Neighbour-aware Thread Management
by: Jeffery, Andrew, et al.
Published: (2024)

Dissecting the software-based measurement of CPU energy consumption: a comparative analysis
by: Raffin, Guillaume, et al.
Published: (2024)