Saved in:
| Main Authors: | , |
|---|---|
| Format: | Recurso digital |
| Language: | |
| Published: |
Zenodo
2025
|
| Online Access: | https://doi.org/10.5281/zenodo.16696258 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Table of Contents:
- <p>In 2023, we introduced the <a href="https://github.com/PrincetonUniversity/jobstats">Jobstats</a> job monitoring platform which provides user-facing commands and interfaces for inspecting <span>the efficiency of Slurm jobs on CPU and GPU clusters. The platform builds on the Prometheus monitoring framework and the Grafana </span>visualization toolkit. The platform has been adopted by tens of institutions throughout the world. In this poster, we provide updates <span>on the platform, which includes the release of a new component for mitigating underutilization. <a href="https://github.com/PrincetonUniversity/job_defense_shield">Job Defense Shield</a> is a software tool </span><span>for identifying (or even automatically cancelling) user jobs that are underutilizing high-performance computing resources such as </span><span>GPUs. Users are sent automated email alerts while system administrators can view reports. Job Defense Shield is a tool for both job </span><span>monitoring and user training.</span></p>