:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Singh, Sidak Pal, Mobahi, Hossein, Agarwala, Atish, Dauphin, Yann
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2502.02407
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Neglected Hessian component explains mysteries in Sharpness regularization
by: Dauphin, Yann N., et al.
Published: (2024)

Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy
by: Singh, Sidak Pal, et al.
Published: (2024)

Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
by: Bozic, Vukasin, et al.
Published: (2023)

High dimensional theory of two-phase optimizers
by: Agarwala, Atish
Published: (2026)

Per-example gradients: a new frontier for understanding and improving optimizers
by: Roulet, Vincent, et al.
Published: (2025)

Introduction to speech recognition
by: Dauphin, Gabriel
Published: (2024)

Accelerating Neural Network Training Along Sharp and Flat Directions
by: Zakarin, Daniyar, et al.
Published: (2025)

Some Fundamental Aspects about Lipschitz Continuity of Neural Networks
by: Khromov, Grigory, et al.
Published: (2023)

A density estimation perspective on learning from pairwise human preferences
by: Dumoulin, Vincent, et al.
Published: (2023)

High dimensional analysis reveals conservative sharpening and a stochastic edge of stability
by: Agarwala, Atish, et al.
Published: (2024)

Feature learning as alignment: a structural property of gradient descent in non-linear neural networks
by: Beaglehole, Daniel, et al.
Published: (2024)

On the Interplay Between Stepsize Tuning and Progressive Sharpening
by: Roulet, Vincent, et al.
Published: (2023)

What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis
by: Ormaniec, Weronika, et al.
Published: (2024)

Theoretical characterisation of the Gauss-Newton conditioning in Neural Networks
by: Zhao, Jim, et al.
Published: (2024)

On the Foundations of Shortcut Learning
by: Hermann, Katherine L., et al.
Published: (2023)

To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions
by: Marshall, Noah, et al.
Published: (2024)

What do near-optimal learning rate schedules look like?
by: Naganuma, Hiroki, et al.
Published: (2026)

Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects
by: Xiao, Ke Liang, et al.
Published: (2024)

Reasoning Boosts Opinion Alignment in LLMs
by: Berdoz, Frédéric, et al.
Published: (2026)

Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning
by: Arefin, Md Rifat, et al.
Published: (2024)

Towards Meta-Pruning via Optimal Transport
by: Theus, Alexander, et al.
Published: (2024)

$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training
by: Zhou, Jin Peng, et al.
Published: (2025)

Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks
by: Qiu, Shikai, et al.
Published: (2025)

Mining Mental Health Signals: A Comparative Study of Four Machine Learning Methods for Depression Detection from Social Media Posts in Sorani Kurdish
by: Mohammed, Idrees, et al.
Published: (2025)

Contextual Graph Transformer: A Small Language Model for Enhanced Engineering Document Information Extraction
by: Reddy, Karan, et al.
Published: (2025)

Data-Aware Random Feature Kernel for Transformers
by: Farzam, Amirhossein, et al.
Published: (2026)

Local vs Global continual learning
by: Lanzillotta, Giulia, et al.
Published: (2024)

Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything
by: Lin, Huawei, et al.
Published: (2025)

KITE: Kernelized and Information Theoretic Exemplars for In-Context Learning
by: Singh, Vaibhav, et al.
Published: (2025)

FEval-TTC: Fair Evaluation Protocol for Test-Time Compute
by: Rumiantsev, Pavel, et al.
Published: (2025)

LLMs can learn self-restraint through iterative self-reflection
by: Piché, Alexandre, et al.
Published: (2024)

Does Representation Matter? Exploring Intermediate Layers in Large Language Models
by: Skean, Oscar, et al.
Published: (2024)

Exploring Precision and Recall to assess the quality and diversity of LLMs
by: Bronnec, Florian Le, et al.
Published: (2024)

Robustmix: Improving Robustness by Regularizing the Frequency Bias of Deep Nets
by: Ngnawe, Jonas, et al.
Published: (2023)

Stepping on the Edge: Curvature Aware Learning Rate Tuners
by: Roulet, Vincent, et al.
Published: (2024)

A Gauge Theory of Superposition: Toward a Sheaf-Theoretic Atlas of Neural Representations
by: Javidnia, Hossein
Published: (2026)

Semantic Sections: An Atlas-Native Feature Ontology for Obstructed Representation Spaces
by: Javidnia, Hossein
Published: (2026)

Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data
by: Rashidi, Sina, et al.
Published: (2025)

A Comprehensive Approach to Misspelling Correction with BERT and Levenshtein Distance
by: Naziri, Amirreza, et al.
Published: (2024)

Phases of Muon: When Muon Eclipses SignSGD
by: Paquette, Elliot, et al.
Published: (2026)