Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.08198 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866912423252852736 |
|---|---|
| author | Xie, Shuo Mohamadi, Mohamad Amin Li, Zhiyuan |
| author_facet | Xie, Shuo Mohamadi, Mohamad Amin Li, Zhiyuan |
| contents | Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps $T$ and is already minimax-optimal in non-convex cases, which are both $\widetilde{O}(T^{-1/4})$. In this work, we argue that the exploitation of nice $\ell_\infty$-geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $\ell_\infty$-geometry rather than the more common $\ell_2$-geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable $\ell_\infty$-geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2410_08198 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity Xie, Shuo Mohamadi, Mohamad Amin Li, Zhiyuan Machine Learning Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps $T$ and is already minimax-optimal in non-convex cases, which are both $\widetilde{O}(T^{-1/4})$. In this work, we argue that the exploitation of nice $\ell_\infty$-geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $\ell_\infty$-geometry rather than the more common $\ell_2$-geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable $\ell_\infty$-geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions. |
| title | Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2410.08198 |