Saved in:
| Main Author: | |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.03478 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915693825359872 |
|---|---|
| author | Nguyen, Quan |
| author_facet | Nguyen, Quan |
| contents | While Adam is one of the most effective optimizer for training large-scale machine learning models, a theoretical understanding of how to optimally set its momentum factors, $β_1$ and $β_2$, remains largely incomplete.
Prior works have shown that Adam can be seen as an instance of Follow-the-Regularized-Leader (FTRL), one of the most important class of algorithms in online learning.
The prior analyses in these works required setting $β_1 = \sqrt{β_2}$, which does not cover the more practical cases with $β_1 \neq \sqrt{β_2}$.
We derive novel, more general analyses that hold for both $β_1 \geq \sqrt{β_2}$ and $β_1 \leq \sqrt{β_2}$.
In both cases, our results strictly generalize the existing bounds.
Furthermore, we show that our bounds are tight in the worst case.
We also prove that setting $β_1 = \sqrt{β_2}$ is optimal for an oblivious adversary, but sub-optimal for an non-oblivious adversary. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2510_03478 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | How to Set $β_1, β_2$ in Adam: An Online Learning Perspective Nguyen, Quan Machine Learning Optimization and Control While Adam is one of the most effective optimizer for training large-scale machine learning models, a theoretical understanding of how to optimally set its momentum factors, $β_1$ and $β_2$, remains largely incomplete. Prior works have shown that Adam can be seen as an instance of Follow-the-Regularized-Leader (FTRL), one of the most important class of algorithms in online learning. The prior analyses in these works required setting $β_1 = \sqrt{β_2}$, which does not cover the more practical cases with $β_1 \neq \sqrt{β_2}$. We derive novel, more general analyses that hold for both $β_1 \geq \sqrt{β_2}$ and $β_1 \leq \sqrt{β_2}$. In both cases, our results strictly generalize the existing bounds. Furthermore, we show that our bounds are tight in the worst case. We also prove that setting $β_1 = \sqrt{β_2}$ is optimal for an oblivious adversary, but sub-optimal for an non-oblivious adversary. |
| title | How to Set $β_1, β_2$ in Adam: An Online Learning Perspective |
| topic | Machine Learning Optimization and Control |
| url | https://arxiv.org/abs/2510.03478 |