Saved in:
Bibliographic Details
Main Authors: Blake, Charlie, Eichenberg, Constantin, Dean, Josef, Balles, Lukas, Prince, Luke Y., Deiseroth, Björn, Cruz-Salinas, Andres Felipe, Luschi, Carlo, Weinbach, Samuel, Orr, Douglas
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2407.17465
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909453408796672
author Blake, Charlie
Eichenberg, Constantin
Dean, Josef
Balles, Lukas
Prince, Luke Y.
Deiseroth, Björn
Cruz-Salinas, Andres Felipe
Luschi, Carlo
Weinbach, Samuel
Orr, Douglas
author_facet Blake, Charlie
Eichenberg, Constantin
Dean, Josef
Balles, Lukas
Prince, Luke Y.
Deiseroth, Björn
Cruz-Salinas, Andres Felipe
Luschi, Carlo
Weinbach, Samuel
Orr, Douglas
contents The Maximal Update Parametrization ($μ$P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-$μ$P, which improves upon $μ$P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $μ$P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-$μ$P models reaching a loss that is equal to or lower than comparable $μ$P models and working out-of-the-box in FP8.
format Preprint
id arxiv_https___arxiv_org_abs_2407_17465
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle u-$μ$P: The Unit-Scaled Maximal Update Parametrization
Blake, Charlie
Eichenberg, Constantin
Dean, Josef
Balles, Lukas
Prince, Luke Y.
Deiseroth, Björn
Cruz-Salinas, Andres Felipe
Luschi, Carlo
Weinbach, Samuel
Orr, Douglas
Machine Learning
The Maximal Update Parametrization ($μ$P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-$μ$P, which improves upon $μ$P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $μ$P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-$μ$P models reaching a loss that is equal to or lower than comparable $μ$P models and working out-of-the-box in FP8.
title u-$μ$P: The Unit-Scaled Maximal Update Parametrization
topic Machine Learning
url https://arxiv.org/abs/2407.17465