:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Bush, Thomas, Chung, Stephen, Anwar, Usman, Garriga-Alonso, Adrià, Krueger, David
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2504.01871
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data
by: Chanin, David, et al.
Published: (2026)

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025)

Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
by: Taufeeque, Mohammad, et al.
Published: (2025)

Among Us: A Sandbox for Measuring and Detecting Agentic Deception
by: Golechha, Satvik, et al.
Published: (2025)

Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025)

Biases in the Blind Spot: Detecting What LLMs Fail to Mention
by: Arcuschin, Iván, et al.
Published: (2026)

Planning in a recurrent neural network that plays Sokoban
by: Taufeeque, Mohammad, et al.
Published: (2024)

Learning to Forget using Hypernetworks
by: Rangel, Jose Miguel Lara, et al.
Published: (2024)

DiFR: Inference Verification Despite Nondeterminism
by: Karvonen, Adam, et al.
Published: (2025)

Hypothesis Testing the Circuit Hypothesis in LLMs
by: Shi, Claudia, et al.
Published: (2024)

Robust Model-Based Reinforcement Learning with an Adversarial Auxiliary Model
by: Herremans, Siemen, et al.
Published: (2024)

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
by: Gupta, Rohan, et al.
Published: (2024)

Investigating the Indirect Object Identification circuit in Mamba
by: Ensign, Danielle, et al.
Published: (2024)

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
by: Kwa, Thomas, et al.
Published: (2024)

Learning from Failures in Multi-Attempt Reinforcement Learning
by: Chung, Stephen, et al.
Published: (2025)

Wavelet-Enhanced Neural ODE and Graph Attention for Interpretable Energy Forecasting
by: Joy, Usman Gani
Published: (2025)

Evaluating Robustness of Reinforcement Learning Algorithms for Autonomous Shipping
by: Lesy, Bavo, et al.
Published: (2024)

Surrogate Fitness Metrics for Interpretable Reinforcement Learning
by: Altmann, Philipp, et al.
Published: (2025)

Reward Model Ensembles Help Mitigate Overoptimization
by: Coste, Thomas, et al.
Published: (2023)

Demystifying MuZero Planning: Interpreting the Learned Model
by: Guei, Hung, et al.
Published: (2024)

Towards Interpretable Deep Reinforcement Learning Models via Inverse Reinforcement Learning
by: Xie, Sean, et al.
Published: (2022)

Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning
by: Pravetz, Thomas
Published: (2026)

Offline Reinforcement Learning with Universal Horizon Models
by: Chung, Hojun, et al.
Published: (2026)

Parseval Regularization for Continual Reinforcement Learning
by: Chung, Wesley, et al.
Published: (2024)

The Interpretability of Codebooks in Model-Based Reinforcement Learning is Limited
by: Eaton, Kenneth, et al.
Published: (2024)

Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering
by: Cox, Kyle, et al.
Published: (2026)

Continual Reinforcement Learning by Planning with Online World Models
by: Liu, Zichen, et al.
Published: (2025)

Gradient Free Deep Reinforcement Learning With TabPFN
by: Schiff, David, et al.
Published: (2025)

Deep Reinforcement Learning for Traffic Light Control in Intelligent Transportation Systems
by: Zhu, Ming, et al.
Published: (2023)

Handling Delay in Real-Time Reinforcement Learning
by: Anokhin, Ivan, et al.
Published: (2025)

Towards General-Purpose Model-Free Reinforcement Learning
by: Fujimoto, Scott, et al.
Published: (2025)

GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models
by: Wang, Mianchu, et al.
Published: (2023)

Interpretable Deep Reinforcement Learning for Element-level Bridge Life-cycle Optimization
by: Moayyedi, Seyyed Amirhossein, et al.
Published: (2026)

Three Pathways to Neurosymbolic Reinforcement Learning with Interpretable Model and Policy Networks
by: Graf, Peter, et al.
Published: (2024)

Distilling Reinforcement Learning Algorithms for In-Context Model-Based Planning
by: Son, Jaehyeon, et al.
Published: (2025)

Inducing, Detecting and Characterising Neural Modules: A Pipeline for Functional Interpretability in Reinforcement Learning
by: Soligo, Anna, et al.
Published: (2025)

From Explainability to Interpretability: Interpretable Policies in Reinforcement Learning Via Model Explanation
by: Li, Peilang, et al.
Published: (2025)

Model-Free Robust Reinforcement Learning with Sample Complexity Analysis
by: Wang, Yudan, et al.
Published: (2024)

Label-Free Reinforcement Learning via Cross-Model Entropy
by: Gorbett, Matt, et al.
Published: (2026)

Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models
by: Li, Zhaoxin, et al.
Published: (2025)