Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Marco, Qi, Xianbiao, He, Yelin, Ye, Jiaquan, Xiao, Rong
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Computation and Language Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2602.01212
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917239632953344
author	Chen, Marco Qi, Xianbiao He, Yelin Ye, Jiaquan Xiao, Rong
author_facet	Chen, Marco Qi, Xianbiao He, Yelin Ye, Jiaquan Xiao, Rong
contents	In this work, we revisit Transformer optimization through the lens of second-order geometry and establish a direct connection between architectural design, activation scale, the Hessian matrix, and the maximum tolerable learning rate. We introduce a simple normalization strategy, termed SimpleNorm, which stabilizes intermediate activation scales by construction. Then, by analyzing the Hessian of the loss with respect to network activations, we theoretically show that SimpleNorm significantly reduces the spectral norm of the Hessian, thereby permitting larger stable learning rates. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B. Empirically, SimpleGPT, our SimpleNorm-based network, tolerates learning rates 3$\times$-10$\times$ larger than standard convention, consistently demonstrates strong optimization stability, and achieves substantially better performance than well-established baselines. Specifically, when training 7B-scale models for 60K steps, SimpleGPT achieves a training loss that is 0.08 lower than that of LLaMA2 with QKNorm, reducing the loss from 2.290 to 2.208. Our source code will be released at https://github.com/Ocram7/SimpleGPT.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_01212
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SimpleGPT: Improving GPT via A Simple Normalization Strategy Chen, Marco Qi, Xianbiao He, Yelin Ye, Jiaquan Xiao, Rong Machine Learning Computation and Language Computer Vision and Pattern Recognition In this work, we revisit Transformer optimization through the lens of second-order geometry and establish a direct connection between architectural design, activation scale, the Hessian matrix, and the maximum tolerable learning rate. We introduce a simple normalization strategy, termed SimpleNorm, which stabilizes intermediate activation scales by construction. Then, by analyzing the Hessian of the loss with respect to network activations, we theoretically show that SimpleNorm significantly reduces the spectral norm of the Hessian, thereby permitting larger stable learning rates. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B. Empirically, SimpleGPT, our SimpleNorm-based network, tolerates learning rates 3$\times$-10$\times$ larger than standard convention, consistently demonstrates strong optimization stability, and achieves substantially better performance than well-established baselines. Specifically, when training 7B-scale models for 60K steps, SimpleGPT achieves a training loss that is 0.08 lower than that of LLaMA2 with QKNorm, reducing the loss from 2.290 to 2.208. Our source code will be released at https://github.com/Ocram7/SimpleGPT.
title	SimpleGPT: Improving GPT via A Simple Normalization Strategy
topic	Machine Learning Computation and Language Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2602.01212

Similar Items