Saved in:
| Main Authors: | Zhang, Hongtao, Zhou, Wenjie, Chen, Wei, Cheng, Xueqi |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.08933 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
The Stability of Singular Distribution: A Spectral Perspective on the Two-Phase Dynamics of Language Model Pre-training
by: Zhang, Hongtao, et al.
Published: (2026)
by: Zhang, Hongtao, et al.
Published: (2026)
BSFA: Leveraging the Subspace Dichotomy to Accelerate Neural Network Training
by: Zhou, Wenjie, et al.
Published: (2025)
by: Zhou, Wenjie, et al.
Published: (2025)
Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training
by: Zhou, Wenjie, et al.
Published: (2026)
by: Zhou, Wenjie, et al.
Published: (2026)
MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training
by: Li, Jiacheng, et al.
Published: (2026)
by: Li, Jiacheng, et al.
Published: (2026)
AdaMuon: Adaptive Muon Optimizer
by: Si, Chongjie, et al.
Published: (2025)
by: Si, Chongjie, et al.
Published: (2025)
Muon Optimizer Accelerates Grokking
by: Tveit, Amund, et al.
Published: (2025)
by: Tveit, Amund, et al.
Published: (2025)
When Muon Optimizer Meets Adversarial Training: A Theoretical and Empirical Study
by: Yan, Jun, et al.
Published: (2026)
by: Yan, Jun, et al.
Published: (2026)
Phases of Muon: When Muon Eclipses SignSGD
by: Paquette, Elliot, et al.
Published: (2026)
by: Paquette, Elliot, et al.
Published: (2026)
Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts
by: Li, Cheng, et al.
Published: (2025)
by: Li, Cheng, et al.
Published: (2025)
LiMuon: Light and Fast Muon Optimizer for Large Models
by: Huang, Feihu, et al.
Published: (2025)
by: Huang, Feihu, et al.
Published: (2025)
MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization
by: Su, Yupeng, et al.
Published: (2026)
by: Su, Yupeng, et al.
Published: (2026)
FedMuon: Accelerating Federated Learning with Matrix Orthogonalization
by: Liu, Junkang, et al.
Published: (2025)
by: Liu, Junkang, et al.
Published: (2025)
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
by: Huang, Feihu, et al.
Published: (2026)
by: Huang, Feihu, et al.
Published: (2026)
Multi-Head Low-Rank Attention
by: Liu, Songtao, et al.
Published: (2026)
by: Liu, Songtao, et al.
Published: (2026)
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
by: Zhang, Jintao, et al.
Published: (2024)
by: Zhang, Jintao, et al.
Published: (2024)
TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers
by: Cheng, Peng, et al.
Published: (2026)
by: Cheng, Peng, et al.
Published: (2026)
Why Softmax Attention Outperforms Linear Attention
by: Deng, Yichuan, et al.
Published: (2023)
by: Deng, Yichuan, et al.
Published: (2023)
Foundation Models in Radiology: What, How, When, Why and Why Not
by: Paschali, Magdalini, et al.
Published: (2024)
by: Paschali, Magdalini, et al.
Published: (2024)
SignMuon: Communication-Efficient Distributed Muon Optimization
by: Mishra, Neel, et al.
Published: (2026)
by: Mishra, Neel, et al.
Published: (2026)
Muon Optimizes Under Spectral Norm Constraints
by: Chen, Lizhang, et al.
Published: (2025)
by: Chen, Lizhang, et al.
Published: (2025)
Effective Quantization of Muon Optimizer States
by: Gupta, Aman, et al.
Published: (2025)
by: Gupta, Aman, et al.
Published: (2025)
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
by: Csordás, Róbert, et al.
Published: (2023)
by: Csordás, Róbert, et al.
Published: (2023)
Optimised Grouped-Query Attention Mechanism for Transformers
by: Chen, Yuang, et al.
Published: (2024)
by: Chen, Yuang, et al.
Published: (2024)
Benign Overfitting in Single-Head Attention
by: Magen, Roey, et al.
Published: (2024)
by: Magen, Roey, et al.
Published: (2024)
The Newton-Muon Optimizer
by: Du, Zhehang, et al.
Published: (2026)
by: Du, Zhehang, et al.
Published: (2026)
Muon: Training and Trade-offs with Latent Attention and MoE
by: Mehta, Sushant, et al.
Published: (2025)
by: Mehta, Sushant, et al.
Published: (2025)
POME: Post Optimization Model Edit via Muon-style Projection
by: Liu, Yong, et al.
Published: (2025)
by: Liu, Yong, et al.
Published: (2025)
Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning
by: Lu, Binghang, et al.
Published: (2026)
by: Lu, Binghang, et al.
Published: (2026)
Interleaved Head Attention
by: Duvvuri, Sai Surya, et al.
Published: (2026)
by: Duvvuri, Sai Surya, et al.
Published: (2026)
NorMuon: Making Muon more efficient and scalable
by: Li, Zichong, et al.
Published: (2025)
by: Li, Zichong, et al.
Published: (2025)
Muown: Row-Norm Control for Muon Optimization
by: Lion, Kai, et al.
Published: (2026)
by: Lion, Kai, et al.
Published: (2026)
Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers
by: Chen, Anrui, et al.
Published: (2026)
by: Chen, Anrui, et al.
Published: (2026)
Adaptive Head Budgeting for Efficient Multi-Head Attention
by: Faye, Bilal, et al.
Published: (2026)
by: Faye, Bilal, et al.
Published: (2026)
When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective
by: Zhang, Zelin, et al.
Published: (2026)
by: Zhang, Zelin, et al.
Published: (2026)
Amortized Variational Inference: When and Why?
by: Margossian, Charles C., et al.
Published: (2023)
by: Margossian, Charles C., et al.
Published: (2023)
When, Where and Why to Average Weights?
by: Ajroldi, Niccolò, et al.
Published: (2025)
by: Ajroldi, Niccolò, et al.
Published: (2025)
Improved Convergence Rates of Muon Optimizer for Nonconvex Optimization
by: Nagashima, Shuntaro, et al.
Published: (2026)
by: Nagashima, Shuntaro, et al.
Published: (2026)
On the Convergence Analysis of Muon
by: Shen, Wei, et al.
Published: (2025)
by: Shen, Wei, et al.
Published: (2025)
Efficient Conditioning Why Pseudo Observation Batch Bayesian Optimization Works When It Does not
by: Nagaswetha, Kumbha, et al.
Published: (2026)
by: Nagaswetha, Kumbha, et al.
Published: (2026)
Convergence Bound and Critical Batch Size of Muon Optimizer
by: Sato, Naoki, et al.
Published: (2025)
by: Sato, Naoki, et al.
Published: (2025)
Similar Items
-
The Stability of Singular Distribution: A Spectral Perspective on the Two-Phase Dynamics of Language Model Pre-training
by: Zhang, Hongtao, et al.
Published: (2026) -
BSFA: Leveraging the Subspace Dichotomy to Accelerate Neural Network Training
by: Zhou, Wenjie, et al.
Published: (2025) -
Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training
by: Zhou, Wenjie, et al.
Published: (2026) -
MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training
by: Li, Jiacheng, et al.
Published: (2026) -
AdaMuon: Adaptive Muon Optimizer
by: Si, Chongjie, et al.
Published: (2025)