Saved in:
| Main Author: | Lingle, Lucas |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.05728 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Transformer-VQ: Linear-Time Transformers via Vector Quantization
by: Lingle, Lucas D.
Published: (2023)
by: Lingle, Lucas D.
Published: (2023)
A Proof of Learning Rate Transfer under $μ$P
by: Hayou, Soufiane
Published: (2025)
by: Hayou, Soufiane
Published: (2025)
Arithmetic-Mean $μ$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets
by: Zhang, Haosong, et al.
Published: (2025)
by: Zhang, Haosong, et al.
Published: (2025)
An Empirical Study of Scaling Laws for Transfer
by: Barnett, Matthew
Published: (2024)
by: Barnett, Matthew
Published: (2024)
An Empirical Study on Ensemble-Based Transfer Learning Bayesian Optimisation with Mixed Variable Types
by: Trinkle, Natasha, et al.
Published: (2026)
by: Trinkle, Natasha, et al.
Published: (2026)
Weight Decay may matter more than muP for Learning Rate Transfer in Practice
by: Kosson, Atli, et al.
Published: (2025)
by: Kosson, Atli, et al.
Published: (2025)
Learning Rate Transfer in Normalized Transformers
by: Shigida, Boris, et al.
Published: (2026)
by: Shigida, Boris, et al.
Published: (2026)
Universal Rates of Empirical Risk Minimization
by: Hanneke, Steve, et al.
Published: (2024)
by: Hanneke, Steve, et al.
Published: (2024)
Multi-Task Learning for Metal Alloy Property Prediction: An Empirical Study of Negative Transfer and Mitigation Strategies
by: Kang, Sungwoo
Published: (2025)
by: Kang, Sungwoo
Published: (2025)
Black-box Adversarial Transferability: An Empirical Study in Cybersecurity Perspective
by: Roshan, Khushnaseeb, et al.
Published: (2024)
by: Roshan, Khushnaseeb, et al.
Published: (2024)
u-$μ$P: The Unit-Scaled Maximal Update Parametrization
by: Blake, Charlie, et al.
Published: (2024)
by: Blake, Charlie, et al.
Published: (2024)
Spectral Condition for $μ$P under Width-Depth Scaling
by: Zheng, Chenyu, et al.
Published: (2026)
by: Zheng, Chenyu, et al.
Published: (2026)
Super Consistency of Neural Network Landscapes and Learning Rate Transfer
by: Noci, Lorenzo, et al.
Published: (2024)
by: Noci, Lorenzo, et al.
Published: (2024)
Optimizers Performance is Task-Dependent: An Empirical Study of Learning Rate Sensitivity in Classification and Regression Tasks
by: Chisom ruth chibuike, et al.
Published: (2026)
by: Chisom ruth chibuike, et al.
Published: (2026)
Empirical Comparison of Membership Inference Attacks in Deep Transfer Learning
by: Bai, Yuxuan, et al.
Published: (2025)
by: Bai, Yuxuan, et al.
Published: (2025)
The lazy (NTK) and rich ($μ$P) regimes: a gentle tutorial
by: Karkada, Dhruva
Published: (2024)
by: Karkada, Dhruva
Published: (2024)
Extending $μ$P: Spectral Conditions for Feature Learning Across Optimizers
by: Gupta, Akshita, et al.
Published: (2026)
by: Gupta, Akshita, et al.
Published: (2026)
How Reasoning Evolves from Post-Training Data: An Empirical Study Using Chess
by: Dionisopoulos, Lucas, et al.
Published: (2026)
by: Dionisopoulos, Lucas, et al.
Published: (2026)
Sensitivity of Stability: Theoretical & Empirical Analysis of Replicability for Adaptive Data Selection in Transfer Learning
by: Singh, Prabhav, et al.
Published: (2025)
by: Singh, Prabhav, et al.
Published: (2025)
Understanding the Generalization of In-Context Learning in Transformers: An Empirical Study
by: Zhang, Xingxuan, et al.
Published: (2025)
by: Zhang, Xingxuan, et al.
Published: (2025)
An Empirical Study of Self-supervised Learning with Wasserstein Distance
by: Yamada, Makoto, et al.
Published: (2023)
by: Yamada, Makoto, et al.
Published: (2023)
μP$^2$: Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling
by: Haas, Moritz, et al.
Published: (2024)
by: Haas, Moritz, et al.
Published: (2024)
GQA-μP: The maximal parameterization update for grouped query attention
by: Chickering, Kyle R., et al.
Published: (2026)
by: Chickering, Kyle R., et al.
Published: (2026)
An Empirical Study of Aegis
by: Saragih, Daniel, et al.
Published: (2024)
by: Saragih, Daniel, et al.
Published: (2024)
$μ$LO: Compute-Efficient Meta-Generalization of Learned Optimizers
by: Thérien, Benjamin, et al.
Published: (2024)
by: Thérien, Benjamin, et al.
Published: (2024)
An Empirical Study of Federated Prompt Learning for Vision Language Model
by: Wang, Zhihao, et al.
Published: (2025)
by: Wang, Zhihao, et al.
Published: (2025)
Matched-Learning-Rate Analysis of Attention Drift and Transfer Retention in Fine-Tuned CLIP
by: Xia, Ruize
Published: (2026)
by: Xia, Ruize
Published: (2026)
Improving Knowledge Distillation in Transfer Learning with Layer-wise Learning Rates
by: Kokane, Shirley, et al.
Published: (2024)
by: Kokane, Shirley, et al.
Published: (2024)
Scaling Diffusion Transformers Efficiently via $μ$P
by: Zheng, Chenyu, et al.
Published: (2025)
by: Zheng, Chenyu, et al.
Published: (2025)
$μ$-Parametrization for Mixture of Experts
by: Małaśnicki, Jan, et al.
Published: (2025)
by: Małaśnicki, Jan, et al.
Published: (2025)
Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit
by: Filatov, Oleg, et al.
Published: (2024)
by: Filatov, Oleg, et al.
Published: (2024)
When Active Learning Falls Short: An Empirical Study on Chemical Reaction Extraction
by: Yu, Simin, et al.
Published: (2026)
by: Yu, Simin, et al.
Published: (2026)
An Empirical Study of Qwen3 Quantization
by: Zheng, Xingyu, et al.
Published: (2025)
by: Zheng, Xingyu, et al.
Published: (2025)
Towards Enhancing the Reproducibility of Deep Learning Bugs: An Empirical Study
by: Shah, Mehil B., et al.
Published: (2024)
by: Shah, Mehil B., et al.
Published: (2024)
Enhancing Two-Player Performance Through Single-Player Knowledge Transfer: An Empirical Study on Atari 2600 Games
by: Saadat, Kimiya, et al.
Published: (2024)
by: Saadat, Kimiya, et al.
Published: (2024)
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
by: Kalra, Dayal Singh, et al.
Published: (2026)
by: Kalra, Dayal Singh, et al.
Published: (2026)
Lag Selection for Univariate Time Series Forecasting using Deep Learning: An Empirical Study
by: Leites, José, et al.
Published: (2024)
by: Leites, José, et al.
Published: (2024)
An Empirical Study of Fault Localisation Techniques for Deep Learning
by: Humbatova, Nargiz, et al.
Published: (2024)
by: Humbatova, Nargiz, et al.
Published: (2024)
Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation
by: Lu, Huimin, et al.
Published: (2024)
by: Lu, Huimin, et al.
Published: (2024)
Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization
by: Chen, Zixiang, et al.
Published: (2025)
by: Chen, Zixiang, et al.
Published: (2025)
Similar Items
-
Transformer-VQ: Linear-Time Transformers via Vector Quantization
by: Lingle, Lucas D.
Published: (2023) -
A Proof of Learning Rate Transfer under $μ$P
by: Hayou, Soufiane
Published: (2025) -
Arithmetic-Mean $μ$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets
by: Zhang, Haosong, et al.
Published: (2025) -
An Empirical Study of Scaling Laws for Transfer
by: Barnett, Matthew
Published: (2024) -
An Empirical Study on Ensemble-Based Transfer Learning Bayesian Optimisation with Mixed Variable Types
by: Trinkle, Natasha, et al.
Published: (2026)