:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zheng, Chenyu, Wang, Rongzhen, Zhang, Xinyu, Li, Chongxuan
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2603.00541
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Scaling Diffusion Transformers Efficiently via $μ$P
by: Zheng, Chenyu, et al.
Published: (2025)

A Theory for Conditional Generative Modeling on Multiple Data Sources
by: Wang, Rongzhen, et al.
Published: (2025)

On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability
by: Zheng, Chenyu, et al.
Published: (2024)

Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation
by: Chen, Yilong, et al.
Published: (2026)

Towards a Principled Muon under $μ\mathsf{P}$: Ensuring Spectral Conditions throughout Training
by: Zhao, John
Published: (2026)

Extending $μ$P: Spectral Conditions for Feature Learning Across Optimizers
by: Gupta, Akshita, et al.
Published: (2026)

The Role of Depth, Width, and Tree Size in Expressiveness of Deep Forest
by: Lyu, Shen-Huan, et al.
Published: (2024)

Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization
by: Chen, Zixiang, et al.
Published: (2025)

Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers?
by: You, Zebin, et al.
Published: (2024)

The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing
by: Nie, Shen, et al.
Published: (2023)

Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time
by: Bordelon, Blake, et al.
Published: (2025)

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
by: Zhu, Fengqi, et al.
Published: (2025)

Depth Separation in Norm-Bounded Infinite-Width Neural Networks
by: Parkinson, Suzanna, et al.
Published: (2024)

u-$μ$P: The Unit-Scaled Maximal Update Parametrization
by: Blake, Charlie, et al.
Published: (2024)

On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling
by: Haas, Moritz, et al.
Published: (2025)

AdaPerceiver: Transformers with Adaptive Width, Depth, and Tokens
by: Jajal, Purvish, et al.
Published: (2025)

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data
by: Ou, Jingyang, et al.
Published: (2024)

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
by: Mlodozeniec, Bruno, et al.
Published: (2025)

Arithmetic-Mean $μ$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets
by: Zhang, Haosong, et al.
Published: (2025)

Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers
by: Yehudai, Gilad, et al.
Published: (2025)

μP$^2$: Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling
by: Haas, Moritz, et al.
Published: (2024)

A Proof of Learning Rate Transfer under $μ$P
by: Hayou, Soufiane
Published: (2025)

Virtual Width Networks
by: Seed, et al.
Published: (2025)

On the Infinite Width and Depth Limits of Predictive Coding Networks
by: Innocenti, Francesco, et al.
Published: (2026)

EquiPocket: an E(3)-Equivariant Geometric Graph Neural Network for Ligand Binding Site Prediction
by: Zhang, Yang, et al.
Published: (2023)

Polynomial Width is Sufficient for Set Representation with High-dimensional Features
by: Wang, Peihao, et al.
Published: (2023)

Robust Monocular Depth Estimation under Challenging Conditions
by: Gasperini, Stefano, et al.
Published: (2023)

The Spectral Dimension of NTKs is Constant: A Theory of Implicit Regularization, Finite-Width Stability, and Scalable Estimation
by: Shukla, Praveen Anilkumar
Published: (2025)

An Empirical Study of $μ$P Learning Rate Transfer
by: Lingle, Lucas
Published: (2024)

SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning
by: Yu, Qifan, et al.
Published: (2026)

Geometric and Dynamic Scaling in Deep Transformers
by: Su, Haoran, et al.
Published: (2026)

Grouped Discrete Representation for Object-Centric Learning
by: Zhao, Rongzhen, et al.
Published: (2024)

Scaling up Masked Diffusion Models on Text
by: Nie, Shen, et al.
Published: (2024)

Depth, Not Data: An Analysis of Hessian Spectral Bifurcation
by: Deng, Shenyang, et al.
Published: (2026)

The Median is Easier than it Looks: Approximation with a Constant-Depth, Linear-Width ReLU Network
by: Dutta, Abhigyan, et al.
Published: (2026)

FAROS: Robust Federated Learning with Adaptive Scaling against Backdoor Attacks
by: Hu, Chenyu, et al.
Published: (2026)

BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference
by: Kou, Siqi, et al.
Published: (2023)

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
by: Li, Jia-Nan, et al.
Published: (2025)

Spectral Algorithms in Misspecified Regression: Convergence under Covariate Shift
by: Liu, Ren-Rui, et al.
Published: (2025)

R$^2$Energy: A Large-Scale Benchmark for Robust Renewable Energy Forecasting under Diverse and Extreme Conditions
by: Sheng, Zhi, et al.
Published: (2026)