Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xie, Zeke, Xu, Zhiqiang, Zhang, Jingzhao, Sato, Issei, Sugiyama, Masashi
Format:	Preprint
Published:	2020
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2011.11152
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909289133637632
author	Xie, Zeke Xu, Zhiqiang Zhang, Jingzhao Sato, Issei Sugiyama, Masashi
author_facet	Xie, Zeke Xu, Zhiqiang Zhang, Jingzhao Sato, Issei Sugiyama, Masashi
contents	Weight decay is a simple yet powerful regularization technique that has been very widely used in training of deep neural networks (DNNs). While weight decay has attracted much attention, previous studies fail to discover some overlooked pitfalls on large gradient norms resulted by weight decay. In this paper, we discover that, weight decay can unfortunately lead to large gradient norms at the final phase (or the terminated solution) of training, which often indicates bad convergence and poor generalization. To mitigate the gradient-norm-centered pitfalls, we present the first practical scheduler for weight decay, called the Scheduled Weight Decay (SWD) method that can dynamically adjust the weight decay strength according to the gradient norm and significantly penalize large gradient norms during training. Our experiments also support that SWD indeed mitigates large gradient norms and often significantly outperforms the conventional constant weight decay strategy for Adaptive Moment Estimation (Adam).
format	Preprint
id	arxiv_https___arxiv_org_abs_2011_11152
institution	arXiv
publishDate	2020
record_format	arxiv
spellingShingle	On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective Xie, Zeke Xu, Zhiqiang Zhang, Jingzhao Sato, Issei Sugiyama, Masashi Machine Learning Artificial Intelligence Weight decay is a simple yet powerful regularization technique that has been very widely used in training of deep neural networks (DNNs). While weight decay has attracted much attention, previous studies fail to discover some overlooked pitfalls on large gradient norms resulted by weight decay. In this paper, we discover that, weight decay can unfortunately lead to large gradient norms at the final phase (or the terminated solution) of training, which often indicates bad convergence and poor generalization. To mitigate the gradient-norm-centered pitfalls, we present the first practical scheduler for weight decay, called the Scheduled Weight Decay (SWD) method that can dynamically adjust the weight decay strength according to the gradient norm and significantly penalize large gradient norms during training. Our experiments also support that SWD indeed mitigates large gradient norms and often significantly outperforms the conventional constant weight decay strategy for Adaptive Moment Estimation (Adam).
title	On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2011.11152

Similar Items