Saved in:
Bibliographic Details
Main Authors: Schuppli, Stefano, Mohamed, Fawzi, Mendonça, Henrique, Mujkanovic, Nina, Palme, Elia, Conciatore, Dino, Drescher, Lukas, Gila, Miguel, Witlox, Pim, VandeVondele, Joost, Martinasso, Maxime, Schulthess, Thomas C., Hoefler, Torsten
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.01880
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911035339833344
author Schuppli, Stefano
Mohamed, Fawzi
Mendonça, Henrique
Mujkanovic, Nina
Palme, Elia
Conciatore, Dino
Drescher, Lukas
Gila, Miguel
Witlox, Pim
VandeVondele, Joost
Martinasso, Maxime
Schulthess, Thomas C.
Hoefler, Torsten
author_facet Schuppli, Stefano
Mohamed, Fawzi
Mendonça, Henrique
Mujkanovic, Nina
Palme, Elia
Conciatore, Dino
Drescher, Lukas
Gila, Miguel
Witlox, Pim
VandeVondele, Joost
Martinasso, Maxime
Schulthess, Thomas C.
Hoefler, Torsten
contents The Alps Research Infrastructure leverages GH200 technology at scale, featuring 10,752 GPUs. Accessing Alps provides a significant computational advantage for researchers in Artificial Intelligence (AI) and Machine Learning (ML). While Alps serves a broad range of scientific communities, traditional HPC services alone are not sufficient to meet the dynamic needs of the ML community. This paper presents an initial investigation into extending HPC service capabilities to better support ML workloads. We identify key challenges and gaps we have observed since the early-access phase (2023) of Alps by the Swiss AI community and propose several technological enhancements. These include a user environment designed to facilitate the adoption of HPC for ML workloads, balancing performance with flexibility; a utility for rapid performance screening of ML applications during development; observability capabilities and data products for inspecting ongoing large-scale ML workloads; a utility to simplify the vetting of allocated nodes for compute readiness; a service plane infrastructure to deploy various types of workloads, including support and inference services; and a storage infrastructure tailored to the specific needs of ML workloads. These enhancements aim to facilitate the execution of ML workloads on HPC systems, increase system usability and resilience, and better align with the needs of the ML community. We also discuss our current approach to security aspects. This paper concludes by placing these proposals in the broader context of changes in the communities served by HPC infrastructure like ours.
format Preprint
id arxiv_https___arxiv_org_abs_2507_01880
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Evolving HPC services to enable ML workloads on HPE Cray EX
Schuppli, Stefano
Mohamed, Fawzi
Mendonça, Henrique
Mujkanovic, Nina
Palme, Elia
Conciatore, Dino
Drescher, Lukas
Gila, Miguel
Witlox, Pim
VandeVondele, Joost
Martinasso, Maxime
Schulthess, Thomas C.
Hoefler, Torsten
Distributed, Parallel, and Cluster Computing
Machine Learning
The Alps Research Infrastructure leverages GH200 technology at scale, featuring 10,752 GPUs. Accessing Alps provides a significant computational advantage for researchers in Artificial Intelligence (AI) and Machine Learning (ML). While Alps serves a broad range of scientific communities, traditional HPC services alone are not sufficient to meet the dynamic needs of the ML community. This paper presents an initial investigation into extending HPC service capabilities to better support ML workloads. We identify key challenges and gaps we have observed since the early-access phase (2023) of Alps by the Swiss AI community and propose several technological enhancements. These include a user environment designed to facilitate the adoption of HPC for ML workloads, balancing performance with flexibility; a utility for rapid performance screening of ML applications during development; observability capabilities and data products for inspecting ongoing large-scale ML workloads; a utility to simplify the vetting of allocated nodes for compute readiness; a service plane infrastructure to deploy various types of workloads, including support and inference services; and a storage infrastructure tailored to the specific needs of ML workloads. These enhancements aim to facilitate the execution of ML workloads on HPC systems, increase system usability and resilience, and better align with the needs of the ML community. We also discuss our current approach to security aspects. This paper concludes by placing these proposals in the broader context of changes in the communities served by HPC infrastructure like ours.
title Evolving HPC services to enable ML workloads on HPE Cray EX
topic Distributed, Parallel, and Cluster Computing
Machine Learning
url https://arxiv.org/abs/2507.01880