Saved in:
| Main Authors: | Singh, Sidak Pal, Mobahi, Hossein, Agarwala, Atish, Dauphin, Yann |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.02407 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Neglected Hessian component explains mysteries in Sharpness regularization
by: Dauphin, Yann N., et al.
Published: (2024)
by: Dauphin, Yann N., et al.
Published: (2024)
Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy
by: Singh, Sidak Pal, et al.
Published: (2024)
by: Singh, Sidak Pal, et al.
Published: (2024)
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
by: Bozic, Vukasin, et al.
Published: (2023)
by: Bozic, Vukasin, et al.
Published: (2023)
High dimensional theory of two-phase optimizers
by: Agarwala, Atish
Published: (2026)
by: Agarwala, Atish
Published: (2026)
Per-example gradients: a new frontier for understanding and improving optimizers
by: Roulet, Vincent, et al.
Published: (2025)
by: Roulet, Vincent, et al.
Published: (2025)
Introduction to speech recognition
by: Dauphin, Gabriel
Published: (2024)
by: Dauphin, Gabriel
Published: (2024)
Accelerating Neural Network Training Along Sharp and Flat Directions
by: Zakarin, Daniyar, et al.
Published: (2025)
by: Zakarin, Daniyar, et al.
Published: (2025)
Some Fundamental Aspects about Lipschitz Continuity of Neural Networks
by: Khromov, Grigory, et al.
Published: (2023)
by: Khromov, Grigory, et al.
Published: (2023)
A density estimation perspective on learning from pairwise human preferences
by: Dumoulin, Vincent, et al.
Published: (2023)
by: Dumoulin, Vincent, et al.
Published: (2023)
High dimensional analysis reveals conservative sharpening and a stochastic edge of stability
by: Agarwala, Atish, et al.
Published: (2024)
by: Agarwala, Atish, et al.
Published: (2024)
Feature learning as alignment: a structural property of gradient descent in non-linear neural networks
by: Beaglehole, Daniel, et al.
Published: (2024)
by: Beaglehole, Daniel, et al.
Published: (2024)
On the Interplay Between Stepsize Tuning and Progressive Sharpening
by: Roulet, Vincent, et al.
Published: (2023)
by: Roulet, Vincent, et al.
Published: (2023)
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis
by: Ormaniec, Weronika, et al.
Published: (2024)
by: Ormaniec, Weronika, et al.
Published: (2024)
Theoretical characterisation of the Gauss-Newton conditioning in Neural Networks
by: Zhao, Jim, et al.
Published: (2024)
by: Zhao, Jim, et al.
Published: (2024)
On the Foundations of Shortcut Learning
by: Hermann, Katherine L., et al.
Published: (2023)
by: Hermann, Katherine L., et al.
Published: (2023)
To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions
by: Marshall, Noah, et al.
Published: (2024)
by: Marshall, Noah, et al.
Published: (2024)
What do near-optimal learning rate schedules look like?
by: Naganuma, Hiroki, et al.
Published: (2026)
by: Naganuma, Hiroki, et al.
Published: (2026)
Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects
by: Xiao, Ke Liang, et al.
Published: (2024)
by: Xiao, Ke Liang, et al.
Published: (2024)
Reasoning Boosts Opinion Alignment in LLMs
by: Berdoz, Frédéric, et al.
Published: (2026)
by: Berdoz, Frédéric, et al.
Published: (2026)
Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning
by: Arefin, Md Rifat, et al.
Published: (2024)
by: Arefin, Md Rifat, et al.
Published: (2024)
Towards Meta-Pruning via Optimal Transport
by: Theus, Alexander, et al.
Published: (2024)
by: Theus, Alexander, et al.
Published: (2024)
$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training
by: Zhou, Jin Peng, et al.
Published: (2025)
by: Zhou, Jin Peng, et al.
Published: (2025)
Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks
by: Qiu, Shikai, et al.
Published: (2025)
by: Qiu, Shikai, et al.
Published: (2025)
Mining Mental Health Signals: A Comparative Study of Four Machine Learning Methods for Depression Detection from Social Media Posts in Sorani Kurdish
by: Mohammed, Idrees, et al.
Published: (2025)
by: Mohammed, Idrees, et al.
Published: (2025)
Contextual Graph Transformer: A Small Language Model for Enhanced Engineering Document Information Extraction
by: Reddy, Karan, et al.
Published: (2025)
by: Reddy, Karan, et al.
Published: (2025)
Data-Aware Random Feature Kernel for Transformers
by: Farzam, Amirhossein, et al.
Published: (2026)
by: Farzam, Amirhossein, et al.
Published: (2026)
Local vs Global continual learning
by: Lanzillotta, Giulia, et al.
Published: (2024)
by: Lanzillotta, Giulia, et al.
Published: (2024)
Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything
by: Lin, Huawei, et al.
Published: (2025)
by: Lin, Huawei, et al.
Published: (2025)
KITE: Kernelized and Information Theoretic Exemplars for In-Context Learning
by: Singh, Vaibhav, et al.
Published: (2025)
by: Singh, Vaibhav, et al.
Published: (2025)
FEval-TTC: Fair Evaluation Protocol for Test-Time Compute
by: Rumiantsev, Pavel, et al.
Published: (2025)
by: Rumiantsev, Pavel, et al.
Published: (2025)
LLMs can learn self-restraint through iterative self-reflection
by: Piché, Alexandre, et al.
Published: (2024)
by: Piché, Alexandre, et al.
Published: (2024)
Does Representation Matter? Exploring Intermediate Layers in Large Language Models
by: Skean, Oscar, et al.
Published: (2024)
by: Skean, Oscar, et al.
Published: (2024)
Exploring Precision and Recall to assess the quality and diversity of LLMs
by: Bronnec, Florian Le, et al.
Published: (2024)
by: Bronnec, Florian Le, et al.
Published: (2024)
Robustmix: Improving Robustness by Regularizing the Frequency Bias of Deep Nets
by: Ngnawe, Jonas, et al.
Published: (2023)
by: Ngnawe, Jonas, et al.
Published: (2023)
Stepping on the Edge: Curvature Aware Learning Rate Tuners
by: Roulet, Vincent, et al.
Published: (2024)
by: Roulet, Vincent, et al.
Published: (2024)
A Gauge Theory of Superposition: Toward a Sheaf-Theoretic Atlas of Neural Representations
by: Javidnia, Hossein
Published: (2026)
by: Javidnia, Hossein
Published: (2026)
Semantic Sections: An Atlas-Native Feature Ontology for Obstructed Representation Spaces
by: Javidnia, Hossein
Published: (2026)
by: Javidnia, Hossein
Published: (2026)
Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data
by: Rashidi, Sina, et al.
Published: (2025)
by: Rashidi, Sina, et al.
Published: (2025)
A Comprehensive Approach to Misspelling Correction with BERT and Levenshtein Distance
by: Naziri, Amirreza, et al.
Published: (2024)
by: Naziri, Amirreza, et al.
Published: (2024)
Phases of Muon: When Muon Eclipses SignSGD
by: Paquette, Elliot, et al.
Published: (2026)
by: Paquette, Elliot, et al.
Published: (2026)
Similar Items
-
Neglected Hessian component explains mysteries in Sharpness regularization
by: Dauphin, Yann N., et al.
Published: (2024) -
Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy
by: Singh, Sidak Pal, et al.
Published: (2024) -
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
by: Bozic, Vukasin, et al.
Published: (2023) -
High dimensional theory of two-phase optimizers
by: Agarwala, Atish
Published: (2026) -
Per-example gradients: a new frontier for understanding and improving optimizers
by: Roulet, Vincent, et al.
Published: (2025)