Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.02157 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866912015062138880 |
|---|---|
| author | Arnaboldi, Luca Dandi, Yatin Krzakala, Florent Loureiro, Bruno Pesce, Luca Stephan, Ludovic |
| author_facet | Arnaboldi, Luca Dandi, Yatin Krzakala, Florent Loureiro, Bruno Pesce, Luca Stephan, Ludovic |
| contents | We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where $\ell$ is the information exponent of the target to be learned \citep{arous2021online} and $d$ is the input dimension. However, larger batch sizes than $n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, \textit{Correlation loss SGD}, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2406_02157 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs Arnaboldi, Luca Dandi, Yatin Krzakala, Florent Loureiro, Bruno Pesce, Luca Stephan, Ludovic Machine Learning We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where $\ell$ is the information exponent of the target to be learned \citep{arous2021online} and $d$ is the input dimension. However, larger batch sizes than $n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, \textit{Correlation loss SGD}, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments. |
| title | Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2406.02157 |