Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Blake, Charlie, Eichenberg, Constantin, Dean, Josef, Balles, Lukas, Prince, Luke Y., Deiseroth, Björn, Cruz-Salinas, Andres Felipe, Luschi, Carlo, Weinbach, Samuel, Orr, Douglas
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2407.17465
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909453408796672
author	Blake, Charlie Eichenberg, Constantin Dean, Josef Balles, Lukas Prince, Luke Y. Deiseroth, Björn Cruz-Salinas, Andres Felipe Luschi, Carlo Weinbach, Samuel Orr, Douglas
author_facet	Blake, Charlie Eichenberg, Constantin Dean, Josef Balles, Lukas Prince, Luke Y. Deiseroth, Björn Cruz-Salinas, Andres Felipe Luschi, Carlo Weinbach, Samuel Orr, Douglas
contents	The Maximal Update Parametrization ($μ$P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-$μ$P, which improves upon $μ$P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $μ$P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-$μ$P models reaching a loss that is equal to or lower than comparable $μ$P models and working out-of-the-box in FP8.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_17465
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	u-$μ$P: The Unit-Scaled Maximal Update Parametrization Blake, Charlie Eichenberg, Constantin Dean, Josef Balles, Lukas Prince, Luke Y. Deiseroth, Björn Cruz-Salinas, Andres Felipe Luschi, Carlo Weinbach, Samuel Orr, Douglas Machine Learning The Maximal Update Parametrization ($μ$P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-$μ$P, which improves upon $μ$P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $μ$P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-$μ$P models reaching a loss that is equal to or lower than comparable $μ$P models and working out-of-the-box in FP8.
title	u-$μ$P: The Unit-Scaled Maximal Update Parametrization
topic	Machine Learning
url	https://arxiv.org/abs/2407.17465

Similar Items