Saved in:
Bibliographic Details
Main Authors: Rae, Christopher, Lee, Joseph K. L., Richings, James, Weiland, Michele
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2404.10536
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • With the rapid increase in machine learning workloads performed on HPC systems, it is beneficial to regularly perform machine learning specific benchmarks to monitor performance and identify issues. Furthermore, as part of the Edinburgh International Data Facility, EPCC currently hosts a wide range of machine learning accelerators including Nvidia GPUs, the Graphcore Bow Pod64 and Cerebras CS-2, which are managed via Kubernetes and Slurm. We extended the Reframe framework to support the Kubernetes scheduler backend, and utilise Reframe to perform machine learning benchmarks, and we discuss the preliminary results collected and challenges involved in integrating Reframe across multiple platforms and architectures.