Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mlodozeniec, Bruno, Ablin, Pierre, Béthune, Louis, Busbridge, Dan, Klein, Michal, Ramapuram, Jason, Cuturi, Marco
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2512.22382
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908734229315584
author	Mlodozeniec, Bruno Ablin, Pierre Béthune, Louis Busbridge, Dan Klein, Michal Ramapuram, Jason Cuturi, Marco
author_facet	Mlodozeniec, Bruno Ablin, Pierre Béthune, Louis Busbridge, Dan Klein, Michal Ramapuram, Jason Cuturi, Marco
contents	Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as $μ$P, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete$^{(d)}$ Parameterisation that unifies scaling in width and depth -- using an adaptation of CompleteP -- as well as in batch-size and training duration. Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem. We demonstrate that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime. Our study covers an extensive range of optimisation hyperparameters of modern models: learning rates, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments demonstrate significant training speed improvements in Large Language Models with the transferred per-module hyperparameters.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_22382
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration Mlodozeniec, Bruno Ablin, Pierre Béthune, Louis Busbridge, Dan Klein, Michal Ramapuram, Jason Cuturi, Marco Machine Learning Artificial Intelligence Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as $μ$P, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete$^{(d)}$ Parameterisation that unifies scaling in width and depth -- using an adaptation of CompleteP -- as well as in batch-size and training duration. Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem. We demonstrate that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime. Our study covers an extensive range of optimisation hyperparameters of modern models: learning rates, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments demonstrate significant training speed improvements in Large Language Models with the transferred per-module hyperparameters.
title	Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2512.22382

Similar Items