Saved in:
| Main Authors: | Chen, Lizhang, Li, Jonathan, Wang, Qi, Liao, Runlong, Li, Shuozhe, Liang, Chen, Lao, Ni, Liu, Qiang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.15403 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Muon Optimizes Under Spectral Norm Constraints
by: Chen, Lizhang, et al.
Published: (2025)
by: Chen, Lizhang, et al.
Published: (2025)
Cautious Weight Decay
by: Chen, Lizhang, et al.
Published: (2025)
by: Chen, Lizhang, et al.
Published: (2025)
Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts
by: Chen, Lizhang, et al.
Published: (2023)
by: Chen, Lizhang, et al.
Published: (2023)
Communication Efficient Distributed Training with Distributed Lion
by: Liu, Bo, et al.
Published: (2024)
by: Liu, Bo, et al.
Published: (2024)
Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts
by: Liao, Fangshuo, et al.
Published: (2025)
by: Liao, Fangshuo, et al.
Published: (2025)
Training-Free Looped Transformers
by: Chen, Lizhang, et al.
Published: (2026)
by: Chen, Lizhang, et al.
Published: (2026)
Imitation Learning from Observations: An Autoregressive Mixture of Experts Approach
by: Wang, Renzi, et al.
Published: (2024)
by: Wang, Renzi, et al.
Published: (2024)
A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models
by: Han, X. Y., et al.
Published: (2025)
by: Han, X. Y., et al.
Published: (2025)
Hierarchical Mixture-of-Experts with Two-Stage Optimization
by: Molodtsov, Gleb, et al.
Published: (2026)
by: Molodtsov, Gleb, et al.
Published: (2026)
A Relaxed Wasserstein Distance Formulation for Mixtures of Radially Contoured Distributions
by: Chen, Keyu, et al.
Published: (2025)
by: Chen, Keyu, et al.
Published: (2025)
Learning to Specialize: Joint Gating-Expert Training for Adaptive MoEs in Decentralized Settings
by: Farhat, Yehya, et al.
Published: (2023)
by: Farhat, Yehya, et al.
Published: (2023)
Diffusion Model for Data-Driven Black-Box Optimization
by: Li, Zihao, et al.
Published: (2024)
by: Li, Zihao, et al.
Published: (2024)
AutoBalance: An Automatic Balancing Framework for Training Physics-Informed Neural Networks
by: An, Kang, et al.
Published: (2025)
by: An, Kang, et al.
Published: (2025)
Gradient descent in matrix factorization: Understanding large initialization
by: Chen, Hengchao, et al.
Published: (2023)
by: Chen, Hengchao, et al.
Published: (2023)
An inexact Bregman proximal point method and its acceleration version for unbalanced optimal transport
by: Chen, Xiang, et al.
Published: (2024)
by: Chen, Xiang, et al.
Published: (2024)
Muon in Associative Memory Learning: Training Dynamics and Scaling Laws
by: Li, Binghui, et al.
Published: (2026)
by: Li, Binghui, et al.
Published: (2026)
Nonconvex Optimization Framework for Group-Sparse Feedback Linear-Quadratic Optimal Control: Non-Penalty Approach
by: Feng, Lechen, et al.
Published: (2025)
by: Feng, Lechen, et al.
Published: (2025)
Nonconvex Optimization Framework for Group-Sparse Feedback Linear-Quadratic Optimal Control: Penalty Approach
by: Feng, Lechen, et al.
Published: (2025)
by: Feng, Lechen, et al.
Published: (2025)
Nonsmooth Nonconvex-Nonconcave Minimax Optimization: Primal-Dual Balancing and Iteration Complexity Analysis
by: Li, Jiajin, et al.
Published: (2022)
by: Li, Jiajin, et al.
Published: (2022)
Feed m Birds with One Scone: Accelerating Multi-task Gradient Balancing via Bi-level Optimization
by: Chen, Xuxing, et al.
Published: (2026)
by: Chen, Xuxing, et al.
Published: (2026)
Asynchronous and Stochastic Distributed Resource Allocation
by: Li, Qiang, et al.
Published: (2025)
by: Li, Qiang, et al.
Published: (2025)
MultiBalance: Multi-Objective Gradient Balancing in Industrial-Scale Multi-Task Recommendation System
by: He, Yun, et al.
Published: (2024)
by: He, Yun, et al.
Published: (2024)
Solving Sparse \& High-Dimensional-Output Regression via Compression
by: Li, Renyuan, et al.
Published: (2024)
by: Li, Renyuan, et al.
Published: (2024)
In-memory Training on Analog Devices with Limited Conductance States via Multi-tile Residual Learning
by: Li, Jindan, et al.
Published: (2025)
by: Li, Jindan, et al.
Published: (2025)
Proximal Oracles for Optimization and Sampling
by: Liang, Jiaming, et al.
Published: (2024)
by: Liang, Jiaming, et al.
Published: (2024)
Neural Network Training Techniques Regularize Optimization Trajectory: An Empirical Study
by: Chen, Cheng, et al.
Published: (2020)
by: Chen, Cheng, et al.
Published: (2020)
Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling
by: Meterez, Alexandru, et al.
Published: (2025)
by: Meterez, Alexandru, et al.
Published: (2025)
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism
by: Lau, Tim Tsz-Kit, et al.
Published: (2024)
by: Lau, Tim Tsz-Kit, et al.
Published: (2024)
Natural Geometry of Robust Data Attribution: From Convex Models to Deep Networks
by: Li, Shihao, et al.
Published: (2025)
by: Li, Shihao, et al.
Published: (2025)
Policy Mirror Descent with Temporal Difference Learning: Sample Complexity under Online Markov Data
by: Li, Wenye, et al.
Published: (2025)
by: Li, Wenye, et al.
Published: (2025)
Inertial Quadratic Majorization Minimization with Application to Kernel Regularized Learning
by: Heng, Qiang, et al.
Published: (2025)
by: Heng, Qiang, et al.
Published: (2025)
Homotopy Relaxation Training Algorithms for Infinite-Width Two-Layer ReLU Neural Networks
by: Yang, Yahong, et al.
Published: (2023)
by: Yang, Yahong, et al.
Published: (2023)
Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks
by: Xu, Xianliang, et al.
Published: (2024)
by: Xu, Xianliang, et al.
Published: (2024)
ADMM Algorithms for Residual Network Training: Convergence Analysis and Parallel Implementation
by: Xu, Jintao, et al.
Published: (2023)
by: Xu, Jintao, et al.
Published: (2023)
GNMR: Runtime Stability Control for Low-Precision Large Language Model Training
by: Kong, Boao, et al.
Published: (2026)
by: Kong, Boao, et al.
Published: (2026)
Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory
by: Zhang, Yufeng, et al.
Published: (2020)
by: Zhang, Yufeng, et al.
Published: (2020)
Quantum Learning and Estimation for Coordinated Operation between Distribution Networks and Energy Communities
by: Zhuang, Yingrui, et al.
Published: (2025)
by: Zhuang, Yingrui, et al.
Published: (2025)
Adaptive Federated Minimax Optimization with Lower Complexities
by: Huang, Feihu, et al.
Published: (2022)
by: Huang, Feihu, et al.
Published: (2022)
A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization
by: Luo, Yudong, et al.
Published: (2024)
by: Luo, Yudong, et al.
Published: (2024)
Principled Bayesian Optimisation in Collaboration with Human Experts
by: Xu, Wenjie, et al.
Published: (2024)
by: Xu, Wenjie, et al.
Published: (2024)
Similar Items
-
Muon Optimizes Under Spectral Norm Constraints
by: Chen, Lizhang, et al.
Published: (2025) -
Cautious Weight Decay
by: Chen, Lizhang, et al.
Published: (2025) -
Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts
by: Chen, Lizhang, et al.
Published: (2023) -
Communication Efficient Distributed Training with Distributed Lion
by: Liu, Bo, et al.
Published: (2024) -
Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts
by: Liao, Fangshuo, et al.
Published: (2025)