Saved in:
Bibliographic Details
Main Authors: Schmid, Larissa, Horzela, Maximilian, Zhyla, Valerii, Giffels, Manuel, Quast, Günter, Koziolek, Anne
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.12741
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914088596013056
author Schmid, Larissa
Horzela, Maximilian
Zhyla, Valerii
Giffels, Manuel
Quast, Günter
Koziolek, Anne
author_facet Schmid, Larissa
Horzela, Maximilian
Zhyla, Valerii
Giffels, Manuel
Quast, Günter
Koziolek, Anne
contents The Worldwide LHC Computing Grid (WLCG) provides the robust computing infrastructure essential for the LHC experiments by integrating global computing resources into a cohesive entity. Simulations of different compute models present a feasible approach for evaluating future adaptations that are able to cope with future increased demands. However, running these simulations incurs a trade-off between accuracy and scalability. For example, while the simulator DCSim can provide accurate results, it falls short on scaling with the size of the simulated platform. Using Generative Machine Learning as a surrogate presents a candidate for overcoming this challenge. In this work, we evaluate the usage of three different Machine Learning models for the simulation of distributed computing systems and assess their ability to generalize to unseen situations. We show that those models can predict central observables derived from execution traces of compute jobs with approximate accuracy but with orders of magnitude faster execution times. Furthermore, we identify potentials for improving the predictions towards better accuracy and generalizability.
format Preprint
id arxiv_https___arxiv_org_abs_2502_12741
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Surrogate Modeling for Scalable Evaluation of Distributed Computing Systems for HEP Applications
Schmid, Larissa
Horzela, Maximilian
Zhyla, Valerii
Giffels, Manuel
Quast, Günter
Koziolek, Anne
Distributed, Parallel, and Cluster Computing
Performance
High Energy Physics - Experiment
The Worldwide LHC Computing Grid (WLCG) provides the robust computing infrastructure essential for the LHC experiments by integrating global computing resources into a cohesive entity. Simulations of different compute models present a feasible approach for evaluating future adaptations that are able to cope with future increased demands. However, running these simulations incurs a trade-off between accuracy and scalability. For example, while the simulator DCSim can provide accurate results, it falls short on scaling with the size of the simulated platform. Using Generative Machine Learning as a surrogate presents a candidate for overcoming this challenge. In this work, we evaluate the usage of three different Machine Learning models for the simulation of distributed computing systems and assess their ability to generalize to unseen situations. We show that those models can predict central observables derived from execution traces of compute jobs with approximate accuracy but with orders of magnitude faster execution times. Furthermore, we identify potentials for improving the predictions towards better accuracy and generalizability.
title Surrogate Modeling for Scalable Evaluation of Distributed Computing Systems for HEP Applications
topic Distributed, Parallel, and Cluster Computing
Performance
High Energy Physics - Experiment
url https://arxiv.org/abs/2502.12741