Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.05960 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908874897883136 |
|---|---|
| author | Yang, Hui Ren, Tao Jiang, Jinyang Tian, Wan Peng, Yijie |
| author_facet | Yang, Hui Ren, Tao Jiang, Jinyang Tian, Wan Peng, Yijie |
| contents | Memory-efficient optimization methods have recently gained increasing attention for scaling full-parameter training of large language models under the GPU-memory bottleneck. Existing approaches either lack clear convergence guarantees, or only achieve the standard ${\mathcal{O}}(ε^{-4})$ iteration complexity in the nonconvex settings. We propose Omni-Masked Gradient Descent (OMGD), an optimization method based on mask traversal for memory efficient training, and provide a nonconvex convergence analysis that establishes a strictly improved iteration complexity of $\tilde{\mathcal{O}}(ε^{-3})$ for finding an $ε$-approximate stationary point. Empirically, OMGD is a lightweight, plug-and-play approach that integrates seamlessly into most mainstream optimizers, yielding consistent improvements over competitive baselines in both fine-tuning and pre-training tasks. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2603_05960 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence Yang, Hui Ren, Tao Jiang, Jinyang Tian, Wan Peng, Yijie Machine Learning Memory-efficient optimization methods have recently gained increasing attention for scaling full-parameter training of large language models under the GPU-memory bottleneck. Existing approaches either lack clear convergence guarantees, or only achieve the standard ${\mathcal{O}}(ε^{-4})$ iteration complexity in the nonconvex settings. We propose Omni-Masked Gradient Descent (OMGD), an optimization method based on mask traversal for memory efficient training, and provide a nonconvex convergence analysis that establishes a strictly improved iteration complexity of $\tilde{\mathcal{O}}(ε^{-3})$ for finding an $ε$-approximate stationary point. Empirically, OMGD is a lightweight, plug-and-play approach that integrates seamlessly into most mainstream optimizers, yielding consistent improvements over competitive baselines in both fine-tuning and pre-training tasks. |
| title | Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2603.05960 |