Saved in:
Bibliographic Details
Main Authors: Xie, Zeke, Xu, Zhiqiang, Zhang, Jingzhao, Sato, Issei, Sugiyama, Masashi
Format: Preprint
Published: 2020
Subjects:
Online Access:https://arxiv.org/abs/2011.11152
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909289133637632
author Xie, Zeke
Xu, Zhiqiang
Zhang, Jingzhao
Sato, Issei
Sugiyama, Masashi
author_facet Xie, Zeke
Xu, Zhiqiang
Zhang, Jingzhao
Sato, Issei
Sugiyama, Masashi
contents Weight decay is a simple yet powerful regularization technique that has been very widely used in training of deep neural networks (DNNs). While weight decay has attracted much attention, previous studies fail to discover some overlooked pitfalls on large gradient norms resulted by weight decay. In this paper, we discover that, weight decay can unfortunately lead to large gradient norms at the final phase (or the terminated solution) of training, which often indicates bad convergence and poor generalization. To mitigate the gradient-norm-centered pitfalls, we present the first practical scheduler for weight decay, called the Scheduled Weight Decay (SWD) method that can dynamically adjust the weight decay strength according to the gradient norm and significantly penalize large gradient norms during training. Our experiments also support that SWD indeed mitigates large gradient norms and often significantly outperforms the conventional constant weight decay strategy for Adaptive Moment Estimation (Adam).
format Preprint
id arxiv_https___arxiv_org_abs_2011_11152
institution arXiv
publishDate 2020
record_format arxiv
spellingShingle On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective
Xie, Zeke
Xu, Zhiqiang
Zhang, Jingzhao
Sato, Issei
Sugiyama, Masashi
Machine Learning
Artificial Intelligence
Weight decay is a simple yet powerful regularization technique that has been very widely used in training of deep neural networks (DNNs). While weight decay has attracted much attention, previous studies fail to discover some overlooked pitfalls on large gradient norms resulted by weight decay. In this paper, we discover that, weight decay can unfortunately lead to large gradient norms at the final phase (or the terminated solution) of training, which often indicates bad convergence and poor generalization. To mitigate the gradient-norm-centered pitfalls, we present the first practical scheduler for weight decay, called the Scheduled Weight Decay (SWD) method that can dynamically adjust the weight decay strength according to the gradient norm and significantly penalize large gradient norms during training. Our experiments also support that SWD indeed mitigates large gradient norms and often significantly outperforms the conventional constant weight decay strategy for Adaptive Moment Estimation (Adam).
title On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2011.11152