Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.00541 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Table of Contents:
- Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($μ$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $μ$P under joint width-depth scaling. For deep residual networks whose residual blocks contain $k$ transformations, the framework specifies how the norms of weights and their per-step updates should scale with width and depth. It reveals a fundamental transition from $k=1$ to $k\geq 2$, unifying previously disparate $μ$P formulations and identifying the $k\geq 2$ case as more appropriate for practical architectures with multi-transformation branches such as Transformers. Building on this framework, we derive a general recipe for implementing $μ$P across a broad class of optimizers by mapping spectral constraints to concrete HP parameterizations, recovering existing results and extending them to additional optimizers. Finally, experiments on GPT-2 style language models show that the $μ$P formulation derived from the $k\geq 2$ case achieves stable feature learning and robust HP transfer under width-depth scaling, whereas standard parameterization and $μ$P in the $k=1$ case often fail to do so. These results support the practical effectiveness of the proposed spectral framework.