Saved in:
| Main Authors: | Ajroldi, Niccolò, Orvieto, Antonio, Geiping, Jonas |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.06761 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Can you Finetune your Binoculars? Embedding Text Watermarks into the Weights of Large Language Models
by: Elhassan, Fay, et al.
Published: (2025)
by: Elhassan, Fay, et al.
Published: (2025)
Training Dynamics Impact Post-Training Quantization Robustness
by: Catalan-Tatjer, Albert, et al.
Published: (2025)
by: Catalan-Tatjer, Albert, et al.
Published: (2025)
Enhancing Optimizer Stability: Momentum Adaptation of The NGN Step-size
by: Islamov, Rustem, et al.
Published: (2025)
by: Islamov, Rustem, et al.
Published: (2025)
Loss Landscape Characterization of Neural Networks without Over-Parametrization
by: Islamov, Rustem, et al.
Published: (2024)
by: Islamov, Rustem, et al.
Published: (2024)
Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling
by: Srećković, Teodora, et al.
Published: (2025)
by: Srećković, Teodora, et al.
Published: (2025)
Deriving Hyperparameter Scaling Laws via Modern Optimization Theory
by: Shulgin, Egor, et al.
Published: (2026)
by: Shulgin, Egor, et al.
Published: (2026)
Fine, I'll Merge It Myself: A Multi-Fidelity Framework for Automated Model Merging
by: Su, Guinan, et al.
Published: (2025)
by: Su, Guinan, et al.
Published: (2025)
Efficiently Dispatching Flash Attention For Partially Filled Attention Masks
by: Sharma, Agniv, et al.
Published: (2024)
by: Sharma, Agniv, et al.
Published: (2024)
Adam Simplified: Bias Correction Debunked
by: Laing, Sam, et al.
Published: (2025)
by: Laing, Sam, et al.
Published: (2025)
Revisiting associative recall in modern recurrent models
by: Okpekpe, Destiny, et al.
Published: (2025)
by: Okpekpe, Destiny, et al.
Published: (2025)
An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes
by: Orvieto, Antonio, et al.
Published: (2024)
by: Orvieto, Antonio, et al.
Published: (2024)
Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models
by: Geiping, Jonas, et al.
Published: (2025)
by: Geiping, Jonas, et al.
Published: (2025)
In Search of Adam's Secret Sauce
by: Orvieto, Antonio, et al.
Published: (2025)
by: Orvieto, Antonio, et al.
Published: (2025)
Recurrent neural networks: vanishing and exploding gradients are not the end of the story
by: Zucchet, Nicolas, et al.
Published: (2024)
by: Zucchet, Nicolas, et al.
Published: (2024)
An Uncertainty Principle for Linear Recurrent Neural Networks
by: François, Alexandre, et al.
Published: (2025)
by: François, Alexandre, et al.
Published: (2025)
Explaining Grokking in Transformers through the Lens of Inductive Bias
by: Singh, Jaisidh, et al.
Published: (2026)
by: Singh, Jaisidh, et al.
Published: (2026)
Universal Dynamics of Warmup Stable Decay: understanding WSD beyond Transformers
by: Belloni, Annalisa, et al.
Published: (2026)
by: Belloni, Annalisa, et al.
Published: (2026)
Improved state mixing in higher-order and block diagonal linear recurrent networks
by: Dubinin, Igor, et al.
Published: (2026)
by: Dubinin, Igor, et al.
Published: (2026)
When Fewer Layers Break More Chains: Layer Pruning Harms Test-Time Scaling in LLMs
by: Wang, Keyu, et al.
Published: (2025)
by: Wang, Keyu, et al.
Published: (2025)
Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs
by: Li, Xueyan, et al.
Published: (2025)
by: Li, Xueyan, et al.
Published: (2025)
Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs
by: Su, Guinan, et al.
Published: (2026)
by: Su, Guinan, et al.
Published: (2026)
Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison
by: Nezhurina, Marianna, et al.
Published: (2025)
by: Nezhurina, Marianna, et al.
Published: (2025)
Pitfalls in Evaluating Language Model Forecasters
by: Paleka, Daniel, et al.
Published: (2025)
by: Paleka, Daniel, et al.
Published: (2025)
NIMBA: Towards Robust and Principled Processing of Point Clouds With SSMs
by: Köprücü, Nursena, et al.
Published: (2024)
by: Köprücü, Nursena, et al.
Published: (2024)
On the low-shot transferability of [V]-Mamba
by: Misra, Diganta, et al.
Published: (2024)
by: Misra, Diganta, et al.
Published: (2024)
Towards Understanding Self-Pretraining for Sequence Classification
by: Coser, Omar, et al.
Published: (2026)
by: Coser, Omar, et al.
Published: (2026)
Geometric Inductive Biases of Deep Networks: The Role of Data and Architecture
by: Movahedi, Sajad, et al.
Published: (2024)
by: Movahedi, Sajad, et al.
Published: (2024)
Super Consistency of Neural Network Landscapes and Learning Rate Transfer
by: Noci, Lorenzo, et al.
Published: (2024)
by: Noci, Lorenzo, et al.
Published: (2024)
Capability-Based Scaling Trends for LLM-Based Red-Teaming
by: Panfilov, Alexander, et al.
Published: (2025)
by: Panfilov, Alexander, et al.
Published: (2025)
Scaling Open-Ended Reasoning to Predict the Future
by: Chandak, Nikhil, et al.
Published: (2025)
by: Chandak, Nikhil, et al.
Published: (2025)
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
by: Armandpour, Mohammadreza, et al.
Published: (2026)
by: Armandpour, Mohammadreza, et al.
Published: (2026)
Fixed-Point RNNs: Interpolating from Diagonal to Dense
by: Movahedi, Sajad, et al.
Published: (2025)
by: Movahedi, Sajad, et al.
Published: (2025)
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
by: Li, Xueyan, et al.
Published: (2026)
by: Li, Xueyan, et al.
Published: (2026)
Adaptive Stochastic Weight Averaging
by: Demir, Caglar, et al.
Published: (2024)
by: Demir, Caglar, et al.
Published: (2024)
When and Why is Optimistic Multiplicative Weights Slow? The Geometry of Energy Dissipation
by: Lazarsfeld, John, et al.
Published: (2026)
by: Lazarsfeld, John, et al.
Published: (2026)
Recurrent Distance Filtering for Graph Representation Learning
by: Ding, Yuhui, et al.
Published: (2023)
by: Ding, Yuhui, et al.
Published: (2023)
Answer Matching Outperforms Multiple Choice for Language Model Evaluation
by: Chandak, Nikhil, et al.
Published: (2025)
by: Chandak, Nikhil, et al.
Published: (2025)
What do we learn from inverting CLIP models?
by: Kazemi, Hamid, et al.
Published: (2024)
by: Kazemi, Hamid, et al.
Published: (2024)
Muown: Row-Norm Control for Muon Optimization
by: Lion, Kai, et al.
Published: (2026)
by: Lion, Kai, et al.
Published: (2026)
Sample Weight Averaging for Stable Prediction
by: Yu, Han, et al.
Published: (2025)
by: Yu, Han, et al.
Published: (2025)
Similar Items
-
Can you Finetune your Binoculars? Embedding Text Watermarks into the Weights of Large Language Models
by: Elhassan, Fay, et al.
Published: (2025) -
Training Dynamics Impact Post-Training Quantization Robustness
by: Catalan-Tatjer, Albert, et al.
Published: (2025) -
Enhancing Optimizer Stability: Momentum Adaptation of The NGN Step-size
by: Islamov, Rustem, et al.
Published: (2025) -
Loss Landscape Characterization of Neural Networks without Over-Parametrization
by: Islamov, Rustem, et al.
Published: (2024) -
Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling
by: Srećković, Teodora, et al.
Published: (2025)