Gespeichert in:
Bibliographische Detailangaben
1. Verfasser: Medeiros, Daniel Nobrega
Format: Preprint
Veröffentlicht: 2026
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2604.07405
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866915924196458496
author Medeiros, Daniel Nobrega
author_facet Medeiros, Daniel Nobrega
contents Why does gradient descent reliably find good solutions in non-convex neural network optimization, despite the landscape being NP-hard in the worst case? We show that gradient flow on L-layer ReLU networks without bias preserves L-1 conservation laws C_l = ||W_{l+1}||_F^2 - ||W_l||_F^2, confining trajectories to lower-dimensional manifolds. Under discrete gradient descent, these laws break with total drift scaling as eta^alpha where alpha is approximately 1.1-1.6 depending on architecture, loss function, and width. We decompose this drift exactly as eta^2 * S(eta), where the gradient imbalance sum S(eta) admits a closed-form spectral crossover formula with mode coefficients c_k proportional to e_k(0)^2 * lambda_{x,k}^2, derived from first principles and validated for both linear (R=0.85) and ReLU (R>0.80) networks. For cross-entropy loss, softmax probability concentration drives exponential Hessian spectral compression with timescale tau = Theta(1/eta) independent of training set size, explaining why cross-entropy self-regularizes the drift exponent near alpha=1.0. We identify two dynamical regimes separated by a width-dependent transition: a perturbative sub-Edge-of-Stability regime where the spectral formula applies, and a non-perturbative regime with extensive mode coupling. All predictions are validated across 23 experiments.
format Preprint
id arxiv_https___arxiv_org_abs_2604_07405
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Conservation Law Breaking at the Edge of Stability: A Spectral Theory of Non-Convex Neural Network Optimization
Medeiros, Daniel Nobrega
Machine Learning
Artificial Intelligence
68T07, 49M37
I.2.6
Why does gradient descent reliably find good solutions in non-convex neural network optimization, despite the landscape being NP-hard in the worst case? We show that gradient flow on L-layer ReLU networks without bias preserves L-1 conservation laws C_l = ||W_{l+1}||_F^2 - ||W_l||_F^2, confining trajectories to lower-dimensional manifolds. Under discrete gradient descent, these laws break with total drift scaling as eta^alpha where alpha is approximately 1.1-1.6 depending on architecture, loss function, and width. We decompose this drift exactly as eta^2 * S(eta), where the gradient imbalance sum S(eta) admits a closed-form spectral crossover formula with mode coefficients c_k proportional to e_k(0)^2 * lambda_{x,k}^2, derived from first principles and validated for both linear (R=0.85) and ReLU (R>0.80) networks. For cross-entropy loss, softmax probability concentration drives exponential Hessian spectral compression with timescale tau = Theta(1/eta) independent of training set size, explaining why cross-entropy self-regularizes the drift exponent near alpha=1.0. We identify two dynamical regimes separated by a width-dependent transition: a perturbative sub-Edge-of-Stability regime where the spectral formula applies, and a non-perturbative regime with extensive mode coupling. All predictions are validated across 23 experiments.
title Conservation Law Breaking at the Edge of Stability: A Spectral Theory of Non-Convex Neural Network Optimization
topic Machine Learning
Artificial Intelligence
68T07, 49M37
I.2.6
url https://arxiv.org/abs/2604.07405