Saved in:
| Main Authors: | Bush, Thomas, Chung, Stephen, Anwar, Usman, Garriga-Alonso, Adrià, Krueger, David |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.01871 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data
by: Chanin, David, et al.
Published: (2026)
by: Chanin, David, et al.
Published: (2026)
Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025)
by: Chanin, David, et al.
Published: (2025)
Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
by: Taufeeque, Mohammad, et al.
Published: (2025)
by: Taufeeque, Mohammad, et al.
Published: (2025)
Among Us: A Sandbox for Measuring and Detecting Agentic Deception
by: Golechha, Satvik, et al.
Published: (2025)
by: Golechha, Satvik, et al.
Published: (2025)
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025)
by: Chanin, David, et al.
Published: (2025)
Biases in the Blind Spot: Detecting What LLMs Fail to Mention
by: Arcuschin, Iván, et al.
Published: (2026)
by: Arcuschin, Iván, et al.
Published: (2026)
Planning in a recurrent neural network that plays Sokoban
by: Taufeeque, Mohammad, et al.
Published: (2024)
by: Taufeeque, Mohammad, et al.
Published: (2024)
Learning to Forget using Hypernetworks
by: Rangel, Jose Miguel Lara, et al.
Published: (2024)
by: Rangel, Jose Miguel Lara, et al.
Published: (2024)
DiFR: Inference Verification Despite Nondeterminism
by: Karvonen, Adam, et al.
Published: (2025)
by: Karvonen, Adam, et al.
Published: (2025)
Hypothesis Testing the Circuit Hypothesis in LLMs
by: Shi, Claudia, et al.
Published: (2024)
by: Shi, Claudia, et al.
Published: (2024)
Robust Model-Based Reinforcement Learning with an Adversarial Auxiliary Model
by: Herremans, Siemen, et al.
Published: (2024)
by: Herremans, Siemen, et al.
Published: (2024)
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
by: Gupta, Rohan, et al.
Published: (2024)
by: Gupta, Rohan, et al.
Published: (2024)
Investigating the Indirect Object Identification circuit in Mamba
by: Ensign, Danielle, et al.
Published: (2024)
by: Ensign, Danielle, et al.
Published: (2024)
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
by: Kwa, Thomas, et al.
Published: (2024)
by: Kwa, Thomas, et al.
Published: (2024)
Learning from Failures in Multi-Attempt Reinforcement Learning
by: Chung, Stephen, et al.
Published: (2025)
by: Chung, Stephen, et al.
Published: (2025)
Wavelet-Enhanced Neural ODE and Graph Attention for Interpretable Energy Forecasting
by: Joy, Usman Gani
Published: (2025)
by: Joy, Usman Gani
Published: (2025)
Evaluating Robustness of Reinforcement Learning Algorithms for Autonomous Shipping
by: Lesy, Bavo, et al.
Published: (2024)
by: Lesy, Bavo, et al.
Published: (2024)
Surrogate Fitness Metrics for Interpretable Reinforcement Learning
by: Altmann, Philipp, et al.
Published: (2025)
by: Altmann, Philipp, et al.
Published: (2025)
Reward Model Ensembles Help Mitigate Overoptimization
by: Coste, Thomas, et al.
Published: (2023)
by: Coste, Thomas, et al.
Published: (2023)
Demystifying MuZero Planning: Interpreting the Learned Model
by: Guei, Hung, et al.
Published: (2024)
by: Guei, Hung, et al.
Published: (2024)
Towards Interpretable Deep Reinforcement Learning Models via Inverse Reinforcement Learning
by: Xie, Sean, et al.
Published: (2022)
by: Xie, Sean, et al.
Published: (2022)
Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning
by: Pravetz, Thomas
Published: (2026)
by: Pravetz, Thomas
Published: (2026)
Offline Reinforcement Learning with Universal Horizon Models
by: Chung, Hojun, et al.
Published: (2026)
by: Chung, Hojun, et al.
Published: (2026)
Parseval Regularization for Continual Reinforcement Learning
by: Chung, Wesley, et al.
Published: (2024)
by: Chung, Wesley, et al.
Published: (2024)
The Interpretability of Codebooks in Model-Based Reinforcement Learning is Limited
by: Eaton, Kenneth, et al.
Published: (2024)
by: Eaton, Kenneth, et al.
Published: (2024)
Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering
by: Cox, Kyle, et al.
Published: (2026)
by: Cox, Kyle, et al.
Published: (2026)
Continual Reinforcement Learning by Planning with Online World Models
by: Liu, Zichen, et al.
Published: (2025)
by: Liu, Zichen, et al.
Published: (2025)
Gradient Free Deep Reinforcement Learning With TabPFN
by: Schiff, David, et al.
Published: (2025)
by: Schiff, David, et al.
Published: (2025)
Deep Reinforcement Learning for Traffic Light Control in Intelligent Transportation Systems
by: Zhu, Ming, et al.
Published: (2023)
by: Zhu, Ming, et al.
Published: (2023)
Handling Delay in Real-Time Reinforcement Learning
by: Anokhin, Ivan, et al.
Published: (2025)
by: Anokhin, Ivan, et al.
Published: (2025)
Towards General-Purpose Model-Free Reinforcement Learning
by: Fujimoto, Scott, et al.
Published: (2025)
by: Fujimoto, Scott, et al.
Published: (2025)
GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models
by: Wang, Mianchu, et al.
Published: (2023)
by: Wang, Mianchu, et al.
Published: (2023)
Interpretable Deep Reinforcement Learning for Element-level Bridge Life-cycle Optimization
by: Moayyedi, Seyyed Amirhossein, et al.
Published: (2026)
by: Moayyedi, Seyyed Amirhossein, et al.
Published: (2026)
Three Pathways to Neurosymbolic Reinforcement Learning with Interpretable Model and Policy Networks
by: Graf, Peter, et al.
Published: (2024)
by: Graf, Peter, et al.
Published: (2024)
Distilling Reinforcement Learning Algorithms for In-Context Model-Based Planning
by: Son, Jaehyeon, et al.
Published: (2025)
by: Son, Jaehyeon, et al.
Published: (2025)
Inducing, Detecting and Characterising Neural Modules: A Pipeline for Functional Interpretability in Reinforcement Learning
by: Soligo, Anna, et al.
Published: (2025)
by: Soligo, Anna, et al.
Published: (2025)
From Explainability to Interpretability: Interpretable Policies in Reinforcement Learning Via Model Explanation
by: Li, Peilang, et al.
Published: (2025)
by: Li, Peilang, et al.
Published: (2025)
Model-Free Robust Reinforcement Learning with Sample Complexity Analysis
by: Wang, Yudan, et al.
Published: (2024)
by: Wang, Yudan, et al.
Published: (2024)
Label-Free Reinforcement Learning via Cross-Model Entropy
by: Gorbett, Matt, et al.
Published: (2026)
by: Gorbett, Matt, et al.
Published: (2026)
Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models
by: Li, Zhaoxin, et al.
Published: (2025)
by: Li, Zhaoxin, et al.
Published: (2025)
Similar Items
-
SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data
by: Chanin, David, et al.
Published: (2026) -
Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025) -
Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
by: Taufeeque, Mohammad, et al.
Published: (2025) -
Among Us: A Sandbox for Measuring and Detecting Agentic Deception
by: Golechha, Satvik, et al.
Published: (2025) -
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025)