Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xie, Shuo, Mohamadi, Mohamad Amin, Li, Zhiyuan
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2410.08198
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912423252852736
author	Xie, Shuo Mohamadi, Mohamad Amin Li, Zhiyuan
author_facet	Xie, Shuo Mohamadi, Mohamad Amin Li, Zhiyuan
contents	Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps $T$ and is already minimax-optimal in non-convex cases, which are both $\widetilde{O}(T^{-1/4})$. In this work, we argue that the exploitation of nice $\ell_\infty$-geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $\ell_\infty$-geometry rather than the more common $\ell_2$-geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable $\ell_\infty$-geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_08198
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity Xie, Shuo Mohamadi, Mohamad Amin Li, Zhiyuan Machine Learning Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps $T$ and is already minimax-optimal in non-convex cases, which are both $\widetilde{O}(T^{-1/4})$. In this work, we argue that the exploitation of nice $\ell_\infty$-geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $\ell_\infty$-geometry rather than the more common $\ell_2$-geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable $\ell_\infty$-geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions.
title	Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity
topic	Machine Learning
url	https://arxiv.org/abs/2410.08198

Similar Items