Saved in:
| Main Authors: | Zheng, Chenyu, Wang, Rongzhen, Zhang, Xinyu, Li, Chongxuan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.00541 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Scaling Diffusion Transformers Efficiently via $μ$P
by: Zheng, Chenyu, et al.
Published: (2025)
by: Zheng, Chenyu, et al.
Published: (2025)
A Theory for Conditional Generative Modeling on Multiple Data Sources
by: Wang, Rongzhen, et al.
Published: (2025)
by: Wang, Rongzhen, et al.
Published: (2025)
On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability
by: Zheng, Chenyu, et al.
Published: (2024)
by: Zheng, Chenyu, et al.
Published: (2024)
Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation
by: Chen, Yilong, et al.
Published: (2026)
by: Chen, Yilong, et al.
Published: (2026)
Towards a Principled Muon under $μ\mathsf{P}$: Ensuring Spectral Conditions throughout Training
by: Zhao, John
Published: (2026)
by: Zhao, John
Published: (2026)
Extending $μ$P: Spectral Conditions for Feature Learning Across Optimizers
by: Gupta, Akshita, et al.
Published: (2026)
by: Gupta, Akshita, et al.
Published: (2026)
The Role of Depth, Width, and Tree Size in Expressiveness of Deep Forest
by: Lyu, Shen-Huan, et al.
Published: (2024)
by: Lyu, Shen-Huan, et al.
Published: (2024)
Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization
by: Chen, Zixiang, et al.
Published: (2025)
by: Chen, Zixiang, et al.
Published: (2025)
Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers?
by: You, Zebin, et al.
Published: (2024)
by: You, Zebin, et al.
Published: (2024)
The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing
by: Nie, Shen, et al.
Published: (2023)
by: Nie, Shen, et al.
Published: (2023)
Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time
by: Bordelon, Blake, et al.
Published: (2025)
by: Bordelon, Blake, et al.
Published: (2025)
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
by: Zhu, Fengqi, et al.
Published: (2025)
by: Zhu, Fengqi, et al.
Published: (2025)
Depth Separation in Norm-Bounded Infinite-Width Neural Networks
by: Parkinson, Suzanna, et al.
Published: (2024)
by: Parkinson, Suzanna, et al.
Published: (2024)
u-$μ$P: The Unit-Scaled Maximal Update Parametrization
by: Blake, Charlie, et al.
Published: (2024)
by: Blake, Charlie, et al.
Published: (2024)
On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling
by: Haas, Moritz, et al.
Published: (2025)
by: Haas, Moritz, et al.
Published: (2025)
AdaPerceiver: Transformers with Adaptive Width, Depth, and Tokens
by: Jajal, Purvish, et al.
Published: (2025)
by: Jajal, Purvish, et al.
Published: (2025)
Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data
by: Ou, Jingyang, et al.
Published: (2024)
by: Ou, Jingyang, et al.
Published: (2024)
Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
by: Mlodozeniec, Bruno, et al.
Published: (2025)
by: Mlodozeniec, Bruno, et al.
Published: (2025)
Arithmetic-Mean $μ$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets
by: Zhang, Haosong, et al.
Published: (2025)
by: Zhang, Haosong, et al.
Published: (2025)
Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers
by: Yehudai, Gilad, et al.
Published: (2025)
by: Yehudai, Gilad, et al.
Published: (2025)
μP$^2$: Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling
by: Haas, Moritz, et al.
Published: (2024)
by: Haas, Moritz, et al.
Published: (2024)
A Proof of Learning Rate Transfer under $μ$P
by: Hayou, Soufiane
Published: (2025)
by: Hayou, Soufiane
Published: (2025)
Virtual Width Networks
by: Seed, et al.
Published: (2025)
by: Seed, et al.
Published: (2025)
On the Infinite Width and Depth Limits of Predictive Coding Networks
by: Innocenti, Francesco, et al.
Published: (2026)
by: Innocenti, Francesco, et al.
Published: (2026)
EquiPocket: an E(3)-Equivariant Geometric Graph Neural Network for Ligand Binding Site Prediction
by: Zhang, Yang, et al.
Published: (2023)
by: Zhang, Yang, et al.
Published: (2023)
Polynomial Width is Sufficient for Set Representation with High-dimensional Features
by: Wang, Peihao, et al.
Published: (2023)
by: Wang, Peihao, et al.
Published: (2023)
Robust Monocular Depth Estimation under Challenging Conditions
by: Gasperini, Stefano, et al.
Published: (2023)
by: Gasperini, Stefano, et al.
Published: (2023)
The Spectral Dimension of NTKs is Constant: A Theory of Implicit Regularization, Finite-Width Stability, and Scalable Estimation
by: Shukla, Praveen Anilkumar
Published: (2025)
by: Shukla, Praveen Anilkumar
Published: (2025)
An Empirical Study of $μ$P Learning Rate Transfer
by: Lingle, Lucas
Published: (2024)
by: Lingle, Lucas
Published: (2024)
SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning
by: Yu, Qifan, et al.
Published: (2026)
by: Yu, Qifan, et al.
Published: (2026)
Geometric and Dynamic Scaling in Deep Transformers
by: Su, Haoran, et al.
Published: (2026)
by: Su, Haoran, et al.
Published: (2026)
Grouped Discrete Representation for Object-Centric Learning
by: Zhao, Rongzhen, et al.
Published: (2024)
by: Zhao, Rongzhen, et al.
Published: (2024)
Scaling up Masked Diffusion Models on Text
by: Nie, Shen, et al.
Published: (2024)
by: Nie, Shen, et al.
Published: (2024)
Depth, Not Data: An Analysis of Hessian Spectral Bifurcation
by: Deng, Shenyang, et al.
Published: (2026)
by: Deng, Shenyang, et al.
Published: (2026)
The Median is Easier than it Looks: Approximation with a Constant-Depth, Linear-Width ReLU Network
by: Dutta, Abhigyan, et al.
Published: (2026)
by: Dutta, Abhigyan, et al.
Published: (2026)
FAROS: Robust Federated Learning with Adaptive Scaling against Backdoor Attacks
by: Hu, Chenyu, et al.
Published: (2026)
by: Hu, Chenyu, et al.
Published: (2026)
BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference
by: Kou, Siqi, et al.
Published: (2023)
by: Kou, Siqi, et al.
Published: (2023)
ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
by: Li, Jia-Nan, et al.
Published: (2025)
by: Li, Jia-Nan, et al.
Published: (2025)
Spectral Algorithms in Misspecified Regression: Convergence under Covariate Shift
by: Liu, Ren-Rui, et al.
Published: (2025)
by: Liu, Ren-Rui, et al.
Published: (2025)
R$^2$Energy: A Large-Scale Benchmark for Robust Renewable Energy Forecasting under Diverse and Extreme Conditions
by: Sheng, Zhi, et al.
Published: (2026)
by: Sheng, Zhi, et al.
Published: (2026)
Similar Items
-
Scaling Diffusion Transformers Efficiently via $μ$P
by: Zheng, Chenyu, et al.
Published: (2025) -
A Theory for Conditional Generative Modeling on Multiple Data Sources
by: Wang, Rongzhen, et al.
Published: (2025) -
On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability
by: Zheng, Chenyu, et al.
Published: (2024) -
Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation
by: Chen, Yilong, et al.
Published: (2026) -
Towards a Principled Muon under $μ\mathsf{P}$: Ensuring Spectral Conditions throughout Training
by: Zhao, John
Published: (2026)