:: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Moalla, Skander, Miele, Andrea, Pyatko, Daniil, Pascanu, Razvan, Gulcehre, Caglar
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2405.00662
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers
by: Wei, Xiuying, et al.
Published: (2024)

Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis
by: Wei, Xiuying, et al.
Published: (2024)

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions
by: Matrenok, Simon, et al.
Published: (2025)

Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues
by: Orvieto, Antonio, et al.
Published: (2023)

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols
by: Terekhov, Mikhail, et al.
Published: (2025)

Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization
by: Masarczyk, Wojciech, et al.
Published: (2025)

Latent Space Representations of Neural Algorithmic Reasoners
by: Mirjanić, Vladimir V., et al.
Published: (2023)

HiPPO-Prophecy: State-Space Models can Provably Learn Dynamical Systems in Context
by: Joseph, Federico Arangath, et al.
Published: (2024)

From Markov to Laplace: How Mamba In-Context Learns Markov Chains
by: Bondaschi, Marco, et al.
Published: (2025)

RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling
by: Wei, Xiuying, et al.
Published: (2025)

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity
by: Wei, Xiuying, et al.
Published: (2026)

BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers
by: Deschenaux, Justin, et al.
Published: (2026)

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
by: Wei, Xiuying, et al.
Published: (2026)

Beyond Autoregression: Fast LLMs via Self-Distillation Through Time
by: Deschenaux, Justin, et al.
Published: (2024)

In Search for Architectures and Loss Functions in Multi-Objective Reinforcement Learning
by: Terekhov, Mikhail, et al.
Published: (2024)

Python Machine Learning Research Template
by: Moalla, Skander
Published: (2025)

Partition Generative Modeling: Masked Modeling Without Masks
by: Deschenaux, Justin, et al.
Published: (2025)

The Role of Deep Learning Regularizations on Actors in Offline RL
by: Tarasov, Denis, et al.
Published: (2024)

The Diffusion Duality, Chapter II: $Ψ$-Samplers
by: Deschenaux, Justin, et al.
Published: (2026)

Regret-Optimized Portfolio Enhancement through Deep Reinforcement Learning and Future Looking Rewards
by: Karzanov, Daniil, et al.
Published: (2025)

Deep Grokking: Would Deep Neural Networks Generalize Better?
by: Fan, Simin, et al.
Published: (2024)

Lattice: Learning to Efficiently Compress the Memory
by: Karami, Mahdi, et al.
Published: (2025)

PPO in the Fisher-Rao geometry
by: Lascu, Razvan-Andrei, et al.
Published: (2025)

NoProp: Training Neural Networks without Full Back-propagation or Full Forward-propagation
by: Li, Qinyu, et al.
Published: (2025)

Trust-Region Behavior Blending for On-Policy Distillation
by: Plyusov, Daniil, et al.
Published: (2026)

Meta-learning how to Share Credit among Macro-Actions
by: Hosu, Ionel-Alexandru, et al.
Published: (2025)

Revisiting Adam for Streaming Reinforcement Learning
by: Gogianu, Florin, et al.
Published: (2026)

What Can Grokking Teach Us About Learning Under Nonstationarity?
by: Lyle, Clare, et al.
Published: (2025)

Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall
by: Jo, Mingyu, et al.
Published: (2025)

PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization
by: Rahman, Ben
Published: (2025)

Simple Hierarchical Planning with Diffusion
by: Chen, Chang, et al.
Published: (2024)

Control Tax: The Price of Keeping AI in Check
by: Terekhov, Mikhail, et al.
Published: (2025)

Softmax is not Enough (for Sharp Size Generalisation)
by: Veličković, Petar, et al.
Published: (2024)

Attention as a Hypernetwork
by: Schug, Simon, et al.
Published: (2024)

MS-SSM: A Multi-Scale State Space Model for Efficient Sequence Modeling
by: Karami, Mahdi, et al.
Published: (2025)

Hadamard product in deep learning: Introduction, Advances and Challenges
by: Chrysos, Grigorios G, et al.
Published: (2025)

Round and Round We Go! What makes Rotary Positional Encodings useful?
by: Barbero, Federico, et al.
Published: (2024)

PlanDQ: Hierarchical Plan Orchestration via D-Conductor and Q-Performer
by: Chen, Chang, et al.
Published: (2024)

Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean-Field Games
by: Ocello, Antonio, et al.
Published: (2025)

Perplexity Cannot Always Tell Right from Wrong
by: Veličković, Petar, et al.
Published: (2026)