Saved in:
| Main Authors: | Moalla, Skander, Miele, Andrea, Pyatko, Daniil, Pascanu, Razvan, Gulcehre, Caglar |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2405.00662 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers
by: Wei, Xiuying, et al.
Published: (2024)
by: Wei, Xiuying, et al.
Published: (2024)
Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis
by: Wei, Xiuying, et al.
Published: (2024)
by: Wei, Xiuying, et al.
Published: (2024)
Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions
by: Matrenok, Simon, et al.
Published: (2025)
by: Matrenok, Simon, et al.
Published: (2025)
Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues
by: Orvieto, Antonio, et al.
Published: (2023)
by: Orvieto, Antonio, et al.
Published: (2023)
Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols
by: Terekhov, Mikhail, et al.
Published: (2025)
by: Terekhov, Mikhail, et al.
Published: (2025)
Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization
by: Masarczyk, Wojciech, et al.
Published: (2025)
by: Masarczyk, Wojciech, et al.
Published: (2025)
Latent Space Representations of Neural Algorithmic Reasoners
by: Mirjanić, Vladimir V., et al.
Published: (2023)
by: Mirjanić, Vladimir V., et al.
Published: (2023)
HiPPO-Prophecy: State-Space Models can Provably Learn Dynamical Systems in Context
by: Joseph, Federico Arangath, et al.
Published: (2024)
by: Joseph, Federico Arangath, et al.
Published: (2024)
From Markov to Laplace: How Mamba In-Context Learns Markov Chains
by: Bondaschi, Marco, et al.
Published: (2025)
by: Bondaschi, Marco, et al.
Published: (2025)
RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling
by: Wei, Xiuying, et al.
Published: (2025)
by: Wei, Xiuying, et al.
Published: (2025)
Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity
by: Wei, Xiuying, et al.
Published: (2026)
by: Wei, Xiuying, et al.
Published: (2026)
BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers
by: Deschenaux, Justin, et al.
Published: (2026)
by: Deschenaux, Justin, et al.
Published: (2026)
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
by: Wei, Xiuying, et al.
Published: (2026)
by: Wei, Xiuying, et al.
Published: (2026)
Beyond Autoregression: Fast LLMs via Self-Distillation Through Time
by: Deschenaux, Justin, et al.
Published: (2024)
by: Deschenaux, Justin, et al.
Published: (2024)
In Search for Architectures and Loss Functions in Multi-Objective Reinforcement Learning
by: Terekhov, Mikhail, et al.
Published: (2024)
by: Terekhov, Mikhail, et al.
Published: (2024)
Python Machine Learning Research Template
by: Moalla, Skander
Published: (2025)
by: Moalla, Skander
Published: (2025)
Partition Generative Modeling: Masked Modeling Without Masks
by: Deschenaux, Justin, et al.
Published: (2025)
by: Deschenaux, Justin, et al.
Published: (2025)
The Role of Deep Learning Regularizations on Actors in Offline RL
by: Tarasov, Denis, et al.
Published: (2024)
by: Tarasov, Denis, et al.
Published: (2024)
The Diffusion Duality, Chapter II: $Ψ$-Samplers
by: Deschenaux, Justin, et al.
Published: (2026)
by: Deschenaux, Justin, et al.
Published: (2026)
Regret-Optimized Portfolio Enhancement through Deep Reinforcement Learning and Future Looking Rewards
by: Karzanov, Daniil, et al.
Published: (2025)
by: Karzanov, Daniil, et al.
Published: (2025)
Deep Grokking: Would Deep Neural Networks Generalize Better?
by: Fan, Simin, et al.
Published: (2024)
by: Fan, Simin, et al.
Published: (2024)
Lattice: Learning to Efficiently Compress the Memory
by: Karami, Mahdi, et al.
Published: (2025)
by: Karami, Mahdi, et al.
Published: (2025)
PPO in the Fisher-Rao geometry
by: Lascu, Razvan-Andrei, et al.
Published: (2025)
by: Lascu, Razvan-Andrei, et al.
Published: (2025)
NoProp: Training Neural Networks without Full Back-propagation or Full Forward-propagation
by: Li, Qinyu, et al.
Published: (2025)
by: Li, Qinyu, et al.
Published: (2025)
Trust-Region Behavior Blending for On-Policy Distillation
by: Plyusov, Daniil, et al.
Published: (2026)
by: Plyusov, Daniil, et al.
Published: (2026)
Meta-learning how to Share Credit among Macro-Actions
by: Hosu, Ionel-Alexandru, et al.
Published: (2025)
by: Hosu, Ionel-Alexandru, et al.
Published: (2025)
Revisiting Adam for Streaming Reinforcement Learning
by: Gogianu, Florin, et al.
Published: (2026)
by: Gogianu, Florin, et al.
Published: (2026)
What Can Grokking Teach Us About Learning Under Nonstationarity?
by: Lyle, Clare, et al.
Published: (2025)
by: Lyle, Clare, et al.
Published: (2025)
Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall
by: Jo, Mingyu, et al.
Published: (2025)
by: Jo, Mingyu, et al.
Published: (2025)
PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization
by: Rahman, Ben
Published: (2025)
by: Rahman, Ben
Published: (2025)
Simple Hierarchical Planning with Diffusion
by: Chen, Chang, et al.
Published: (2024)
by: Chen, Chang, et al.
Published: (2024)
Control Tax: The Price of Keeping AI in Check
by: Terekhov, Mikhail, et al.
Published: (2025)
by: Terekhov, Mikhail, et al.
Published: (2025)
Softmax is not Enough (for Sharp Size Generalisation)
by: Veličković, Petar, et al.
Published: (2024)
by: Veličković, Petar, et al.
Published: (2024)
Attention as a Hypernetwork
by: Schug, Simon, et al.
Published: (2024)
by: Schug, Simon, et al.
Published: (2024)
MS-SSM: A Multi-Scale State Space Model for Efficient Sequence Modeling
by: Karami, Mahdi, et al.
Published: (2025)
by: Karami, Mahdi, et al.
Published: (2025)
Hadamard product in deep learning: Introduction, Advances and Challenges
by: Chrysos, Grigorios G, et al.
Published: (2025)
by: Chrysos, Grigorios G, et al.
Published: (2025)
Round and Round We Go! What makes Rotary Positional Encodings useful?
by: Barbero, Federico, et al.
Published: (2024)
by: Barbero, Federico, et al.
Published: (2024)
PlanDQ: Hierarchical Plan Orchestration via D-Conductor and Q-Performer
by: Chen, Chang, et al.
Published: (2024)
by: Chen, Chang, et al.
Published: (2024)
Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean-Field Games
by: Ocello, Antonio, et al.
Published: (2025)
by: Ocello, Antonio, et al.
Published: (2025)
Perplexity Cannot Always Tell Right from Wrong
by: Veličković, Petar, et al.
Published: (2026)
by: Veličković, Petar, et al.
Published: (2026)
Similar Items
-
Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers
by: Wei, Xiuying, et al.
Published: (2024) -
Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis
by: Wei, Xiuying, et al.
Published: (2024) -
Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions
by: Matrenok, Simon, et al.
Published: (2025) -
Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues
by: Orvieto, Antonio, et al.
Published: (2023) -
Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols
by: Terekhov, Mikhail, et al.
Published: (2025)