MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Mehmeti-Göpel, Christian H. X. Ali, Wand, Michael
Natura:	Preprint
Pubblicazione:	2023
Soggetti:	Machine Learning Artificial Intelligence
Accesso online:	https://arxiv.org/abs/2306.00700
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866913360043311104
author	Mehmeti-Göpel, Christian H. X. Ali Wand, Michael
author_facet	Mehmeti-Göpel, Christian H. X. Ali Wand, Michael
contents	Recent studies have shown that high disparities in effective learning rates (ELRs) across layers in deep neural networks can negatively affect trainability. We formalize how these disparities evolve over time by modeling weight dynamics (evolution of expected gradient and weight norms) of networks with normalization layers, predicting the evolution of layer-wise ELR ratios. We prove that when training with any constant learning rate, ELR ratios converge to 1, despite initial gradient explosion. We identify a ``critical learning rate" beyond which ELR disparities widen, which only depends on current ELRs. To validate our findings, we devise a hyper-parameter-free warm-up method that successfully minimizes ELR spread quickly in theory and practice. Our experiments link ELR spread with trainability, a relationship that is most evident in very deep networks with significant gradient magnitude excursions.
format	Preprint
id	arxiv_https___arxiv_org_abs_2306_00700
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	On the Weight Dynamics of Deep Normalized Networks Mehmeti-Göpel, Christian H. X. Ali Wand, Michael Machine Learning Artificial Intelligence Recent studies have shown that high disparities in effective learning rates (ELRs) across layers in deep neural networks can negatively affect trainability. We formalize how these disparities evolve over time by modeling weight dynamics (evolution of expected gradient and weight norms) of networks with normalization layers, predicting the evolution of layer-wise ELR ratios. We prove that when training with any constant learning rate, ELR ratios converge to 1, despite initial gradient explosion. We identify a ``critical learning rate" beyond which ELR disparities widen, which only depends on current ELRs. To validate our findings, we devise a hyper-parameter-free warm-up method that successfully minimizes ELR spread quickly in theory and practice. Our experiments link ELR spread with trainability, a relationship that is most evident in very deep networks with significant gradient magnitude excursions.
title	On the Weight Dynamics of Deep Normalized Networks
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2306.00700

Documenti analoghi