:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Hongtao, Zhou, Wenjie, Chen, Wei, Cheng, Xueqi
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2605.08933
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

The Stability of Singular Distribution: A Spectral Perspective on the Two-Phase Dynamics of Language Model Pre-training
by: Zhang, Hongtao, et al.
Published: (2026)

BSFA: Leveraging the Subspace Dichotomy to Accelerate Neural Network Training
by: Zhou, Wenjie, et al.
Published: (2025)

Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training
by: Zhou, Wenjie, et al.
Published: (2026)

MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training
by: Li, Jiacheng, et al.
Published: (2026)

AdaMuon: Adaptive Muon Optimizer
by: Si, Chongjie, et al.
Published: (2025)

Muon Optimizer Accelerates Grokking
by: Tveit, Amund, et al.
Published: (2025)

When Muon Optimizer Meets Adversarial Training: A Theoretical and Empirical Study
by: Yan, Jun, et al.
Published: (2026)

Phases of Muon: When Muon Eclipses SignSGD
by: Paquette, Elliot, et al.
Published: (2026)

Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts
by: Li, Cheng, et al.
Published: (2025)

LiMuon: Light and Fast Muon Optimizer for Large Models
by: Huang, Feihu, et al.
Published: (2025)

MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization
by: Su, Yupeng, et al.
Published: (2026)

FedMuon: Accelerating Federated Learning with Matrix Orthogonalization
by: Liu, Junkang, et al.
Published: (2025)

MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
by: Huang, Feihu, et al.
Published: (2026)

Multi-Head Low-Rank Attention
by: Liu, Songtao, et al.
Published: (2026)

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
by: Zhang, Jintao, et al.
Published: (2024)

TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers
by: Cheng, Peng, et al.
Published: (2026)

Why Softmax Attention Outperforms Linear Attention
by: Deng, Yichuan, et al.
Published: (2023)

Foundation Models in Radiology: What, How, When, Why and Why Not
by: Paschali, Magdalini, et al.
Published: (2024)

SignMuon: Communication-Efficient Distributed Muon Optimization
by: Mishra, Neel, et al.
Published: (2026)

Muon Optimizes Under Spectral Norm Constraints
by: Chen, Lizhang, et al.
Published: (2025)

Effective Quantization of Muon Optimizer States
by: Gupta, Aman, et al.
Published: (2025)

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
by: Csordás, Róbert, et al.
Published: (2023)

Optimised Grouped-Query Attention Mechanism for Transformers
by: Chen, Yuang, et al.
Published: (2024)

Benign Overfitting in Single-Head Attention
by: Magen, Roey, et al.
Published: (2024)

The Newton-Muon Optimizer
by: Du, Zhehang, et al.
Published: (2026)

Muon: Training and Trade-offs with Latent Attention and MoE
by: Mehta, Sushant, et al.
Published: (2025)

POME: Post Optimization Model Edit via Muon-style Projection
by: Liu, Yong, et al.
Published: (2025)

Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning
by: Lu, Binghang, et al.
Published: (2026)

Interleaved Head Attention
by: Duvvuri, Sai Surya, et al.
Published: (2026)

NorMuon: Making Muon more efficient and scalable
by: Li, Zichong, et al.
Published: (2025)

Muown: Row-Norm Control for Muon Optimization
by: Lion, Kai, et al.
Published: (2026)

Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers
by: Chen, Anrui, et al.
Published: (2026)

Adaptive Head Budgeting for Efficient Multi-Head Attention
by: Faye, Bilal, et al.
Published: (2026)

When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective
by: Zhang, Zelin, et al.
Published: (2026)

Amortized Variational Inference: When and Why?
by: Margossian, Charles C., et al.
Published: (2023)

When, Where and Why to Average Weights?
by: Ajroldi, Niccolò, et al.
Published: (2025)

Improved Convergence Rates of Muon Optimizer for Nonconvex Optimization
by: Nagashima, Shuntaro, et al.
Published: (2026)

On the Convergence Analysis of Muon
by: Shen, Wei, et al.
Published: (2025)

Efficient Conditioning Why Pseudo Observation Batch Bayesian Optimization Works When It Does not
by: Nagaswetha, Kumbha, et al.
Published: (2026)

Convergence Bound and Critical Batch Size of Muon Optimizer
by: Sato, Naoki, et al.
Published: (2025)