Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.17465 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866909453408796672 |
|---|---|
| author | Blake, Charlie Eichenberg, Constantin Dean, Josef Balles, Lukas Prince, Luke Y. Deiseroth, Björn Cruz-Salinas, Andres Felipe Luschi, Carlo Weinbach, Samuel Orr, Douglas |
| author_facet | Blake, Charlie Eichenberg, Constantin Dean, Josef Balles, Lukas Prince, Luke Y. Deiseroth, Björn Cruz-Salinas, Andres Felipe Luschi, Carlo Weinbach, Samuel Orr, Douglas |
| contents | The Maximal Update Parametrization ($μ$P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-$μ$P, which improves upon $μ$P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $μ$P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-$μ$P models reaching a loss that is equal to or lower than comparable $μ$P models and working out-of-the-box in FP8. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2407_17465 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | u-$μ$P: The Unit-Scaled Maximal Update Parametrization Blake, Charlie Eichenberg, Constantin Dean, Josef Balles, Lukas Prince, Luke Y. Deiseroth, Björn Cruz-Salinas, Andres Felipe Luschi, Carlo Weinbach, Samuel Orr, Douglas Machine Learning The Maximal Update Parametrization ($μ$P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-$μ$P, which improves upon $μ$P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $μ$P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-$μ$P models reaching a loss that is equal to or lower than comparable $μ$P models and working out-of-the-box in FP8. |
| title | u-$μ$P: The Unit-Scaled Maximal Update Parametrization |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2407.17465 |