Saved in:
Bibliographic Details
Main Authors: Halverson, Jonathan, Plazonic, Josko
Format: Recurso digital
Language:
Published: Zenodo 2025
Online Access:https://doi.org/10.5281/zenodo.16696258
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • <p>In 2023, we introduced the <a href="https://github.com/PrincetonUniversity/jobstats">Jobstats</a> job monitoring platform which provides user-facing commands and interfaces for inspecting <span>the efficiency of Slurm jobs on CPU and GPU clusters. The platform builds on the Prometheus monitoring framework and the Grafana </span>visualization toolkit. The platform has been adopted by tens of institutions throughout the world. In this poster, we provide updates <span>on the platform, which includes the release of a new component for mitigating underutilization. <a href="https://github.com/PrincetonUniversity/job_defense_shield">Job Defense Shield</a> is a software tool </span><span>for identifying (or even automatically cancelling) user jobs that are underutilizing high-performance computing resources such as </span><span>GPUs. Users are sent automated email alerts while system administrators can view reports. Job Defense Shield is a tool for both job </span><span>monitoring and user training.</span></p>