Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Coles, Jonathan, Schuppli, Stefano, Drescher, Lukas, Mohamed, Fawzi Roberto, Palme, Elia, Mendonça, Henrique, Gila, Miguel, Klein, Mark, Martinasso, Maxime, VandeVondele, Joost, Hoefler, Torsten, Schulthess, Thomas, Romero, Josh, Gorodetsky, Igor, Hankins, Ryan, Wazirzada, Isa, Jaggi, Martin, Bosselut, Antoine, Schlag, Imanol, Llaquet, Antoni-Joan Solergibert i, Cano, Alejandro Hernández, Manitaras, Theofilos Ioannis, Browning, Nicholas John
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2604.12973
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911595191336960
author	Coles, Jonathan Schuppli, Stefano Drescher, Lukas Mohamed, Fawzi Roberto Palme, Elia Mendonça, Henrique Gila, Miguel Klein, Mark Martinasso, Maxime VandeVondele, Joost Hoefler, Torsten Schulthess, Thomas Romero, Josh Gorodetsky, Igor Hankins, Ryan Wazirzada, Isa Jaggi, Martin Bosselut, Antoine Schlag, Imanol Llaquet, Antoni-Joan Solergibert i Cano, Alejandro Hernández Manitaras, Theofilos Ioannis Browning, Nicholas John
author_facet	Coles, Jonathan Schuppli, Stefano Drescher, Lukas Mohamed, Fawzi Roberto Palme, Elia Mendonça, Henrique Gila, Miguel Klein, Mark Martinasso, Maxime VandeVondele, Joost Hoefler, Torsten Schulthess, Thomas Romero, Josh Gorodetsky, Igor Hankins, Ryan Wazirzada, Isa Jaggi, Martin Bosselut, Antoine Schlag, Imanol Llaquet, Antoni-Joan Solergibert i Cano, Alejandro Hernández Manitaras, Theofilos Ioannis Browning, Nicholas John
contents	Large Language Models (LLMs) have surged as a transformative technology for science and society, prompting governments worldwide to pursue sovereign AI capabilities that ensure data compliance and cultural representation. However, the associated capital costs and engineering complexity required to train these models have largely restricted such capabilities to the private sector, leaving a significant gap for public institutions. This paper details the engineering journey behind training Apertus, a fully open multilingual foundation model, on the Alps supercomputer. Representing a first-of-its-kind achievement for academia at the 70B parameter scale, we successfully deployed a massive pre-training campaign on one of Europe's largest systems for open science, powered by NVIDIA GH200 Grace Hopper Superchips. We detail the challenges encountered in readying HPC infrastructure for training AI models, from overcoming storage bottlenecks to stabilizing large-scale interconnects, and the lessons learned in transforming a supercomputer into a resilient software-defined Machine Learning Platform. Finally, we discuss the post-training requirements and evolution of our Machine Learning platform, outlining how this initial release lays the groundwork for a sustained, iterative operational capability, in particular for fine tuning foundation models, that extends well beyond a single model training run.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_12973
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience Coles, Jonathan Schuppli, Stefano Drescher, Lukas Mohamed, Fawzi Roberto Palme, Elia Mendonça, Henrique Gila, Miguel Klein, Mark Martinasso, Maxime VandeVondele, Joost Hoefler, Torsten Schulthess, Thomas Romero, Josh Gorodetsky, Igor Hankins, Ryan Wazirzada, Isa Jaggi, Martin Bosselut, Antoine Schlag, Imanol Llaquet, Antoni-Joan Solergibert i Cano, Alejandro Hernández Manitaras, Theofilos Ioannis Browning, Nicholas John Distributed, Parallel, and Cluster Computing Large Language Models (LLMs) have surged as a transformative technology for science and society, prompting governments worldwide to pursue sovereign AI capabilities that ensure data compliance and cultural representation. However, the associated capital costs and engineering complexity required to train these models have largely restricted such capabilities to the private sector, leaving a significant gap for public institutions. This paper details the engineering journey behind training Apertus, a fully open multilingual foundation model, on the Alps supercomputer. Representing a first-of-its-kind achievement for academia at the 70B parameter scale, we successfully deployed a massive pre-training campaign on one of Europe's largest systems for open science, powered by NVIDIA GH200 Grace Hopper Superchips. We detail the challenges encountered in readying HPC infrastructure for training AI models, from overcoming storage bottlenecks to stabilizing large-scale interconnects, and the lessons learned in transforming a supercomputer into a resilient software-defined Machine Learning Platform. Finally, we discuss the post-training requirements and evolution of our Machine Learning platform, outlining how this initial release lays the groundwork for a sustained, iterative operational capability, in particular for fine tuning foundation models, that extends well beyond a single model training run.
title	An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience
topic	Distributed, Parallel, and Cluster Computing
url	https://arxiv.org/abs/2604.12973

Similar Items