Saved in:
| Main Authors: | Morwani, Depen, Vyas, Nikhil, Zhang, Hanlin, Kakade, Sham |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.02431 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton
by: Abreu, Natalie, et al.
Published: (2025)
by: Abreu, Natalie, et al.
Published: (2025)
Deconstructing What Makes a Good Optimizer for Language Models
by: Zhao, Rosie, et al.
Published: (2024)
by: Zhao, Rosie, et al.
Published: (2024)
Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging
by: Meterez, Alexandru, et al.
Published: (2026)
by: Meterez, Alexandru, et al.
Published: (2026)
How Does Critical Batch Size Scale in Pre-training?
by: Zhang, Hanlin, et al.
Published: (2024)
by: Zhang, Hanlin, et al.
Published: (2024)
Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning
by: Vyas, Nikhil, et al.
Published: (2023)
by: Vyas, Nikhil, et al.
Published: (2023)
Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling
by: Meterez, Alexandru, et al.
Published: (2025)
by: Meterez, Alexandru, et al.
Published: (2025)
SOAP: Improving and Stabilizing Shampoo using Adam
by: Vyas, Nikhil, et al.
Published: (2024)
by: Vyas, Nikhil, et al.
Published: (2024)
Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise
by: Liu, Bingbin, et al.
Published: (2025)
by: Liu, Bingbin, et al.
Published: (2025)
The AdEMAMix Optimizer: Better, Faster, Older
by: Pagliardini, Matteo, et al.
Published: (2024)
by: Pagliardini, Matteo, et al.
Published: (2024)
A New Perspective on Shampoo's Preconditioner
by: Morwani, Depen, et al.
Published: (2024)
by: Morwani, Depen, et al.
Published: (2024)
A Simplified Analysis of SGD for Linear Regression with Weight Averaging
by: Meterez, Alexandru, et al.
Published: (2025)
by: Meterez, Alexandru, et al.
Published: (2025)
LOTION: Smoothing the Optimization Landscape for Quantized Training
by: Kwun, Mujin, et al.
Published: (2025)
by: Kwun, Mujin, et al.
Published: (2025)
Loss-to-Loss Prediction: Scaling Laws for All Datasets
by: Brandfonbrener, David, et al.
Published: (2024)
by: Brandfonbrener, David, et al.
Published: (2024)
Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning
by: Jin, Jikai, et al.
Published: (2025)
by: Jin, Jikai, et al.
Published: (2025)
Prescriptive Scaling Reveals the Evolution of Language Model Capabilities
by: Zhang, Hanlin, et al.
Published: (2026)
by: Zhang, Hanlin, et al.
Published: (2026)
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
by: Oncescu, Costin-Andrei, et al.
Published: (2026)
by: Oncescu, Costin-Andrei, et al.
Published: (2026)
Feature emergence via margin maximization: case studies in algebraic tasks
by: Morwani, Depen, et al.
Published: (2023)
by: Morwani, Depen, et al.
Published: (2023)
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training
by: Brandfonbrener, David, et al.
Published: (2024)
by: Brandfonbrener, David, et al.
Published: (2024)
Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems
by: Qi, Zhenting, et al.
Published: (2024)
by: Qi, Zhenting, et al.
Published: (2024)
Learning Hidden Markov Models Using Conditional Samples
by: Kakade, Sham M., et al.
Published: (2023)
by: Kakade, Sham M., et al.
Published: (2023)
Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond
by: Oncescu, Costin-Andrei, et al.
Published: (2024)
by: Oncescu, Costin-Andrei, et al.
Published: (2024)
A Study on the Calibration of In-context Learning
by: Zhang, Hanlin, et al.
Published: (2023)
by: Zhang, Hanlin, et al.
Published: (2023)
EvoLM: In Search of Lost Language Model Training Dynamics
by: Qi, Zhenting, et al.
Published: (2025)
by: Qi, Zhenting, et al.
Published: (2025)
Repeat After Me: Transformers are Better than State Space Models at Copying
by: Jelassi, Samy, et al.
Published: (2024)
by: Jelassi, Samy, et al.
Published: (2024)
Soup to go: mitigating forgetting during continual learning with model averaging
by: Kleiman, Anat, et al.
Published: (2025)
by: Kleiman, Anat, et al.
Published: (2025)
Scaling Laws for Imitation Learning in Single-Agent Games
by: Tuyls, Jens, et al.
Published: (2023)
by: Tuyls, Jens, et al.
Published: (2023)
The Role of Sparsity for Length Generalization in Transformers
by: Golowich, Noah, et al.
Published: (2025)
by: Golowich, Noah, et al.
Published: (2025)
Scaling Laws in Linear Regression: Compute, Parameters, and Data
by: Lin, Licong, et al.
Published: (2024)
by: Lin, Licong, et al.
Published: (2024)
Transcendence: Generative Models Can Outperform The Experts That Train Them
by: Zhang, Edwin, et al.
Published: (2024)
by: Zhang, Edwin, et al.
Published: (2024)
Analysis of Schedule-Free Nonconvex Optimization
by: Brown, Connor
Published: (2025)
by: Brown, Connor
Published: (2025)
Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques
by: Zhang, Natalia, et al.
Published: (2024)
by: Zhang, Natalia, et al.
Published: (2024)
ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models
by: Defazio, Aaron
Published: (2026)
by: Defazio, Aaron
Published: (2026)
Anytime Training with Schedule-Free Spectral Optimization
by: Apte, Anuj, et al.
Published: (2026)
by: Apte, Anuj, et al.
Published: (2026)
Peer-Predictive Self-Training for Language Model Reasoning
by: Feng, Shi, et al.
Published: (2026)
by: Feng, Shi, et al.
Published: (2026)
Accumulative SGD Influence Estimation for Data Attribution
by: Shi, Yunxiao, et al.
Published: (2025)
by: Shi, Yunxiao, et al.
Published: (2025)
Deep Learning with Tabular Data: A Self-supervised Approach
by: Vyas, Tirth Kiranbhai
Published: (2024)
by: Vyas, Tirth Kiranbhai
Published: (2024)
Anon: Extrapolating Adaptivity Beyond SGD and Adam
by: Zhang, Yiheng, et al.
Published: (2026)
by: Zhang, Yiheng, et al.
Published: (2026)
Distinguishing the Knowable from the Unknowable with Language Models
by: Ahdritz, Gustaf, et al.
Published: (2024)
by: Ahdritz, Gustaf, et al.
Published: (2024)
Observation-Free Attacks on Online Learning to Rank
by: Chattopadhyay, Sameep, et al.
Published: (2025)
by: Chattopadhyay, Sameep, et al.
Published: (2025)
Bootstrap SGD: Algorithmic Stability and Robustness
by: Christmann, Andreas, et al.
Published: (2024)
by: Christmann, Andreas, et al.
Published: (2024)
Similar Items
-
The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton
by: Abreu, Natalie, et al.
Published: (2025) -
Deconstructing What Makes a Good Optimizer for Language Models
by: Zhao, Rosie, et al.
Published: (2024) -
Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging
by: Meterez, Alexandru, et al.
Published: (2026) -
How Does Critical Batch Size Scale in Pre-training?
by: Zhang, Hanlin, et al.
Published: (2024) -
Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning
by: Vyas, Nikhil, et al.
Published: (2023)