Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.01763 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866912217667993600 |
|---|---|
| author | Zhang, Thomas T. Moniri, Behrad Nagwekar, Ansh Rahman, Faraz Xue, Anton Hassani, Hamed Matni, Nikolai |
| author_facet | Zhang, Thomas T. Moniri, Behrad Nagwekar, Ansh Rahman, Faraz Xue, Anton Hassani, Hamed Matni, Nikolai |
| contents | Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, linear representation learning and single-index learning, which are widely used to study how typical algorithms efficiently learn useful features to enable generalization. In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim \mathsf{N}(\mathbf{0}, \mathbf{I})$ and well-conditioned settings typically assumed in prior work. We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2502_01763 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning Zhang, Thomas T. Moniri, Behrad Nagwekar, Ansh Rahman, Faraz Xue, Anton Hassani, Hamed Matni, Nikolai Machine Learning Optimization and Control Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, linear representation learning and single-index learning, which are widely used to study how typical algorithms efficiently learn useful features to enable generalization. In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim \mathsf{N}(\mathbf{0}, \mathbf{I})$ and well-conditioned settings typically assumed in prior work. We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning. |
| title | On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning |
| topic | Machine Learning Optimization and Control |
| url | https://arxiv.org/abs/2502.01763 |