Saved in:
Bibliographic Details
Main Authors: Mlodozeniec, Bruno, Ablin, Pierre, Béthune, Louis, Busbridge, Dan, Klein, Michal, Ramapuram, Jason, Cuturi, Marco
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.22382
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908734229315584
author Mlodozeniec, Bruno
Ablin, Pierre
Béthune, Louis
Busbridge, Dan
Klein, Michal
Ramapuram, Jason
Cuturi, Marco
author_facet Mlodozeniec, Bruno
Ablin, Pierre
Béthune, Louis
Busbridge, Dan
Klein, Michal
Ramapuram, Jason
Cuturi, Marco
contents Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as $μ$P, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete$^{(d)}$ Parameterisation that unifies scaling in width and depth -- using an adaptation of CompleteP -- as well as in batch-size and training duration. Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem. We demonstrate that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime. Our study covers an extensive range of optimisation hyperparameters of modern models: learning rates, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments demonstrate significant training speed improvements in Large Language Models with the transferred per-module hyperparameters.
format Preprint
id arxiv_https___arxiv_org_abs_2512_22382
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
Mlodozeniec, Bruno
Ablin, Pierre
Béthune, Louis
Busbridge, Dan
Klein, Michal
Ramapuram, Jason
Cuturi, Marco
Machine Learning
Artificial Intelligence
Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as $μ$P, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete$^{(d)}$ Parameterisation that unifies scaling in width and depth -- using an adaptation of CompleteP -- as well as in batch-size and training duration. Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem. We demonstrate that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime. Our study covers an extensive range of optimisation hyperparameters of modern models: learning rates, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments demonstrate significant training speed improvements in Large Language Models with the transferred per-module hyperparameters.
title Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2512.22382