:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Stubbs, Joe, Padhy, Smruti, Cardone, Richard
Format:	Preprint
Published:	2024
Subjects:	Performance Distributed, Parallel, and Cluster Computing Machine Learning
Online Access:	https://arxiv.org/abs/2408.03349
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

GPU Cluster Scheduling for Network-Sensitive Deep Learning
by: Sharma, Aakash, et al.
Published: (2024)

Is Intelligence the Right Direction in New OS Scheduling for Multiple Resources in Cloud Environments?
by: Dou, Xinglei, et al.
Published: (2025)

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms
by: Lin, Zhongyi, et al.
Published: (2024)

Prompt-Aware Scheduling for Low-Latency LLM Serving
by: Tao, Yiheng, et al.
Published: (2025)

Agentic Auto-Scheduling: An Experimental Study of LLM-Guided Loop Optimization
by: Merouani, Massinissa, et al.
Published: (2025)

A Comparative Study of OpenMP Scheduling Algorithm Selection Strategies
by: Korndörfer, Jonas H. Müller, et al.
Published: (2025)

KVDirect: Distributed Disaggregated LLM Inference
by: Chen, Shiyang, et al.
Published: (2024)

KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
by: Jiang, Chaoyi, et al.
Published: (2024)

When Less is More: Achieving Faster Convergence in Distributed Edge Machine Learning
by: Basani, Advik Raj, et al.
Published: (2024)

iSpLib: A Library for Accelerating Graph Neural Networks using Auto-tuned Sparse Operations
by: Anik, Md Saidul Hoque, et al.
Published: (2024)

cedar: Optimized and Unified Machine Learning Input Data Pipelines
by: Zhao, Mark, et al.
Published: (2024)

MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance Optimizations
by: Dutta, Akash, et al.
Published: (2024)

AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference
by: Zhao, Xuanlei, et al.
Published: (2024)

MassiveGNN: Efficient Training via Prefetching for Massively Connected Distributed Graphs
by: Sarkar, Aishwarya, et al.
Published: (2024)

Less is More: Optimizing Function Calling for LLM Execution on Edge Devices
by: Paramanayakam, Varatheepan, et al.
Published: (2024)

BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures
by: Hu, Xiannan, et al.
Published: (2025)

Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation
by: Zarkadas, Ioannis, et al.
Published: (2025)

InkStream: Real-time GNN Inference on Streaming Graphs via Incremental Update
by: Wu, Dan, et al.
Published: (2023)

Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors
by: Titopoulos, Vasileios, et al.
Published: (2025)

TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition
by: Vellaisamy, Prabhu, et al.
Published: (2026)

DeepCQ: General-Purpose Deep-Surrogate Framework for Lossy Compression Quality Prediction
by: Mumenin, Khondoker Mirazul, et al.
Published: (2025)

xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads
by: Shi, Jiabo, et al.
Published: (2025)

IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency
by: Ghafouri, Saeid, et al.
Published: (2023)

CoFormer: Collaborating with Heterogeneous Edge Devices for Scalable Transformer Inference
by: Xu, Guanyu, et al.
Published: (2025)

Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution
by: Li, Zhuojin, et al.
Published: (2025)

Multi-DNN Inference of Sparse Models on Edge SoCs
by: Luo, Jiawei, et al.
Published: (2026)

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
by: Sridharan, Srinivas, et al.
Published: (2026)

AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
by: Gupta, Ahan, et al.
Published: (2026)

ReLATE: Learning Efficient Sparse Encoding for High-Performance Tensor Decomposition
by: Helal, Ahmed E., et al.
Published: (2025)

Execution time budget assignment for mixed criticality systems
by: Khelassi, Mohamed Amine, et al.
Published: (2023)

Ecomap: Sustainability-Driven Optimization of Multi-Tenant DNN Execution on Edge Servers
by: Paramanayakam, Varatheepan, et al.
Published: (2025)

Distributed Matrix-Based Sampling for Graph Neural Network Training
by: Tripathy, Alok, et al.
Published: (2023)

CloudFormer: An Attention-based Performance Prediction for Public Clouds with Unknown Workload
by: Shahbazinia, Amirhossein, et al.
Published: (2025)

Glinthawk: A Two-Tiered Architecture for Offline LLM Inference
by: Hamadanian, Pouya, et al.
Published: (2025)

CARMA: Collocation-Aware Resource Manager
by: Yousefzadeh-Asl-Miandoab, Ehsan, et al.
Published: (2025)

Ariel-ML: Computing Parallelization with Embedded Rust for Neural Networks on Heterogeneous Multi-core Microcontrollers
by: Huang, Zhaolan, et al.
Published: (2025)

You Don't Need All Attentions: Distributed Dynamic Fine-Tuning for Foundation Models
by: Ding, Shiwei, et al.
Published: (2025)

Tuning the Tuner: Introducing Hyperparameter Optimization for Auto-Tuning
by: Willemsen, Floris-Jan, et al.
Published: (2025)

A Practical Two-Stage Framework for GPU Resource and Power Prediction in Heterogeneous HPC Systems
by: Oztop, Beste, et al.
Published: (2026)

Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers
by: Maczan, Jędrzej
Published: (2026)