:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ajroldi, Niccolò, Orvieto, Antonio, Geiping, Jonas
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2502.06761
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Can you Finetune your Binoculars? Embedding Text Watermarks into the Weights of Large Language Models
by: Elhassan, Fay, et al.
Published: (2025)

Training Dynamics Impact Post-Training Quantization Robustness
by: Catalan-Tatjer, Albert, et al.
Published: (2025)

Enhancing Optimizer Stability: Momentum Adaptation of The NGN Step-size
by: Islamov, Rustem, et al.
Published: (2025)

Loss Landscape Characterization of Neural Networks without Over-Parametrization
by: Islamov, Rustem, et al.
Published: (2024)

Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling
by: Srećković, Teodora, et al.
Published: (2025)

Deriving Hyperparameter Scaling Laws via Modern Optimization Theory
by: Shulgin, Egor, et al.
Published: (2026)

Fine, I'll Merge It Myself: A Multi-Fidelity Framework for Automated Model Merging
by: Su, Guinan, et al.
Published: (2025)

Efficiently Dispatching Flash Attention For Partially Filled Attention Masks
by: Sharma, Agniv, et al.
Published: (2024)

Adam Simplified: Bias Correction Debunked
by: Laing, Sam, et al.
Published: (2025)

Revisiting associative recall in modern recurrent models
by: Okpekpe, Destiny, et al.
Published: (2025)

An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes
by: Orvieto, Antonio, et al.
Published: (2024)

Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models
by: Geiping, Jonas, et al.
Published: (2025)

In Search of Adam's Secret Sauce
by: Orvieto, Antonio, et al.
Published: (2025)

Recurrent neural networks: vanishing and exploding gradients are not the end of the story
by: Zucchet, Nicolas, et al.
Published: (2024)

An Uncertainty Principle for Linear Recurrent Neural Networks
by: François, Alexandre, et al.
Published: (2025)

Explaining Grokking in Transformers through the Lens of Inductive Bias
by: Singh, Jaisidh, et al.
Published: (2026)

Universal Dynamics of Warmup Stable Decay: understanding WSD beyond Transformers
by: Belloni, Annalisa, et al.
Published: (2026)

Improved state mixing in higher-order and block diagonal linear recurrent networks
by: Dubinin, Igor, et al.
Published: (2026)

When Fewer Layers Break More Chains: Layer Pruning Harms Test-Time Scaling in LLMs
by: Wang, Keyu, et al.
Published: (2025)

Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs
by: Li, Xueyan, et al.
Published: (2025)

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs
by: Su, Guinan, et al.
Published: (2026)

Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison
by: Nezhurina, Marianna, et al.
Published: (2025)

Pitfalls in Evaluating Language Model Forecasters
by: Paleka, Daniel, et al.
Published: (2025)

NIMBA: Towards Robust and Principled Processing of Point Clouds With SSMs
by: Köprücü, Nursena, et al.
Published: (2024)

On the low-shot transferability of [V]-Mamba
by: Misra, Diganta, et al.
Published: (2024)

Towards Understanding Self-Pretraining for Sequence Classification
by: Coser, Omar, et al.
Published: (2026)

Geometric Inductive Biases of Deep Networks: The Role of Data and Architecture
by: Movahedi, Sajad, et al.
Published: (2024)

Super Consistency of Neural Network Landscapes and Learning Rate Transfer
by: Noci, Lorenzo, et al.
Published: (2024)

Capability-Based Scaling Trends for LLM-Based Red-Teaming
by: Panfilov, Alexander, et al.
Published: (2025)

Scaling Open-Ended Reasoning to Predict the Future
by: Chandak, Nikhil, et al.
Published: (2025)

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
by: Armandpour, Mohammadreza, et al.
Published: (2026)

Fixed-Point RNNs: Interpolating from Diagonal to Dense
by: Movahedi, Sajad, et al.
Published: (2025)

Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
by: Li, Xueyan, et al.
Published: (2026)

Adaptive Stochastic Weight Averaging
by: Demir, Caglar, et al.
Published: (2024)

When and Why is Optimistic Multiplicative Weights Slow? The Geometry of Energy Dissipation
by: Lazarsfeld, John, et al.
Published: (2026)

Recurrent Distance Filtering for Graph Representation Learning
by: Ding, Yuhui, et al.
Published: (2023)

Answer Matching Outperforms Multiple Choice for Language Model Evaluation
by: Chandak, Nikhil, et al.
Published: (2025)

What do we learn from inverting CLIP models?
by: Kazemi, Hamid, et al.
Published: (2024)

Muown: Row-Norm Control for Muon Optimization
by: Lion, Kai, et al.
Published: (2026)

Sample Weight Averaging for Stable Prediction
by: Yu, Han, et al.
Published: (2025)