Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Bu, Zhiqi, Xu, Shiyun, Mao, Jialin
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Computation and Language Optimization and Control
Online Access:	https://arxiv.org/abs/2602.07145
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915781192712192
author	Bu, Zhiqi Xu, Shiyun Mao, Jialin
author_facet	Bu, Zhiqi Xu, Shiyun Mao, Jialin
contents	Deep learning has non-convex loss landscape and its optimization dynamics is hard to analyze or control. Nevertheless, the dynamics can be empirically convex-like across various tasks, models, optimizers, hyperparameters, etc. In this work, we examine the applicability of convexity and Lipschitz continuity in deep learning, in order to precisely control the loss dynamics via the learning rate schedules. We illustrate that deep learning quickly becomes weakly convex after a short period of training, and the loss is predicable by an upper bound on the last iterate, which further informs the scaling of optimal learning rate. Through the lens of convexity, we build scaling laws of learning rates and losses that extrapolate as much as 80X across training horizons and 70X across model sizes.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_07145
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate Bu, Zhiqi Xu, Shiyun Mao, Jialin Machine Learning Computation and Language Optimization and Control Deep learning has non-convex loss landscape and its optimization dynamics is hard to analyze or control. Nevertheless, the dynamics can be empirically convex-like across various tasks, models, optimizers, hyperparameters, etc. In this work, we examine the applicability of convexity and Lipschitz continuity in deep learning, in order to precisely control the loss dynamics via the learning rate schedules. We illustrate that deep learning quickly becomes weakly convex after a short period of training, and the loss is predicable by an upper bound on the last iterate, which further informs the scaling of optimal learning rate. Through the lens of convexity, we build scaling laws of learning rates and losses that extrapolate as much as 80X across training horizons and 70X across model sizes.
title	Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate
topic	Machine Learning Computation and Language Optimization and Control
url	https://arxiv.org/abs/2602.07145

Similar Items