Saved in:
| Main Authors: | Soligo, Anna, Ferraro, Pietro, Boyle, David |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.17077 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Probabilistic Constrained Reinforcement Learning with Formal Interpretability
by: Wang, Yanran, et al.
Published: (2023)
by: Wang, Yanran, et al.
Published: (2023)
Convergent Linear Representations of Emergent Misalignment
by: Soligo, Anna, et al.
Published: (2025)
by: Soligo, Anna, et al.
Published: (2025)
Model Organisms for Emergent Misalignment
by: Turner, Edward, et al.
Published: (2025)
by: Turner, Edward, et al.
Published: (2025)
Interpretable Hierarchical Concept Reasoning through Attention-Guided Graph Learning
by: Debot, David, et al.
Published: (2025)
by: Debot, David, et al.
Published: (2025)
CALF: Communication-Aware Learning Framework for Distributed Reinforcement Learning
by: Purves, Carlos, et al.
Published: (2026)
by: Purves, Carlos, et al.
Published: (2026)
Neural Network Conversion of Machine Learning Pipelines
by: Sung, Man-Ling, et al.
Published: (2026)
by: Sung, Man-Ling, et al.
Published: (2026)
Neural Interpretable Reasoning
by: Barbiero, Pietro, et al.
Published: (2025)
by: Barbiero, Pietro, et al.
Published: (2025)
Interpreting Emergent Planning in Model-Free Reinforcement Learning
by: Bush, Thomas, et al.
Published: (2025)
by: Bush, Thomas, et al.
Published: (2025)
Actionable Interpretability via Causal Hypergraphs: Unravelling Batch Size Effects in Deep Learning
by: Sun, Zhongtian, et al.
Published: (2025)
by: Sun, Zhongtian, et al.
Published: (2025)
Three Dogmas of Reinforcement Learning
by: Abel, David, et al.
Published: (2024)
by: Abel, David, et al.
Published: (2024)
IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking
by: Beigi, Mohammad, et al.
Published: (2026)
by: Beigi, Mohammad, et al.
Published: (2026)
Interpretable Neural-Symbolic Concept Reasoning
by: Barbiero, Pietro, et al.
Published: (2023)
by: Barbiero, Pietro, et al.
Published: (2023)
Interpretable Deep Reinforcement Learning for Element-level Bridge Life-cycle Optimization
by: Moayyedi, Seyyed Amirhossein, et al.
Published: (2026)
by: Moayyedi, Seyyed Amirhossein, et al.
Published: (2026)
Towards Interpretable Deep Reinforcement Learning Models via Inverse Reinforcement Learning
by: Xie, Sean, et al.
Published: (2022)
by: Xie, Sean, et al.
Published: (2022)
Surrogate Fitness Metrics for Interpretable Reinforcement Learning
by: Altmann, Philipp, et al.
Published: (2025)
by: Altmann, Philipp, et al.
Published: (2025)
Interpretable Concept-Based Memory Reasoning
by: Debot, David, et al.
Published: (2024)
by: Debot, David, et al.
Published: (2024)
Interpretability by Design for Efficient Multi-Objective Reinforcement Learning
by: Xia, Qiyue, et al.
Published: (2025)
by: Xia, Qiyue, et al.
Published: (2025)
Evaluating Interpretable Reinforcement Learning by Distilling Policies into Programs
by: Kohler, Hector, et al.
Published: (2025)
by: Kohler, Hector, et al.
Published: (2025)
Interpretable and Editable Programmatic Tree Policies for Reinforcement Learning
by: Kohler, Hector, et al.
Published: (2024)
by: Kohler, Hector, et al.
Published: (2024)
The Interpretability of Codebooks in Model-Based Reinforcement Learning is Limited
by: Eaton, Kenneth, et al.
Published: (2024)
by: Eaton, Kenneth, et al.
Published: (2024)
Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards
by: Zhang, Hanping, et al.
Published: (2025)
by: Zhang, Hanping, et al.
Published: (2025)
LICORICE: Label-Efficient Concept-Based Interpretable Reinforcement Learning
by: Ye, Zhuorui, et al.
Published: (2024)
by: Ye, Zhuorui, et al.
Published: (2024)
Towards Interpretable Reinforcement Learning with Constrained Normalizing Flow Policies
by: Rietz, Finn, et al.
Published: (2024)
by: Rietz, Finn, et al.
Published: (2024)
Compositional Function Networks: A High-Performance Alternative to Deep Neural Networks with Built-in Interpretability
by: Li, Fang
Published: (2025)
by: Li, Fang
Published: (2025)
Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning
by: Yan, John, et al.
Published: (2026)
by: Yan, John, et al.
Published: (2026)
Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning
by: Pravetz, Thomas
Published: (2026)
by: Pravetz, Thomas
Published: (2026)
Three Pathways to Neurosymbolic Reinforcement Learning with Interpretable Model and Policy Networks
by: Graf, Peter, et al.
Published: (2024)
by: Graf, Peter, et al.
Published: (2024)
On Leakage in Machine Learning Pipelines
by: Sasse, Leonard, et al.
Published: (2023)
by: Sasse, Leonard, et al.
Published: (2023)
NSF-MAP: Neurosymbolic Multimodal Fusion for Robust and Interpretable Anomaly Prediction in Assembly Pipelines
by: Shyalika, Chathurangi, et al.
Published: (2025)
by: Shyalika, Chathurangi, et al.
Published: (2025)
Neural Approaches to SAT Solving: Design Choices and Interpretability
by: Mojžíšek, David, et al.
Published: (2025)
by: Mojžíšek, David, et al.
Published: (2025)
A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction
by: Huang, Rui, et al.
Published: (2026)
by: Huang, Rui, et al.
Published: (2026)
Learning Interpretable Policies in Hindsight-Observable POMDPs through Partially Supervised Reinforcement Learning
by: Lanier, Michael, et al.
Published: (2024)
by: Lanier, Michael, et al.
Published: (2024)
Metric Learning for Clifford Group Equivariant Neural Networks
by: Ali, Riccardo, et al.
Published: (2024)
by: Ali, Riccardo, et al.
Published: (2024)
Neural Lyapunov Function Approximation with Self-Supervised Reinforcement Learning
by: McCutcheon, Luc, et al.
Published: (2025)
by: McCutcheon, Luc, et al.
Published: (2025)
Towards Mechanistic Interpretability of Graph Transformers via Attention Graphs
by: El, Batu, et al.
Published: (2025)
by: El, Batu, et al.
Published: (2025)
Distributed Multi-Agent Reinforcement Learning Based on Graph-Induced Local Value Functions
by: Jing, Gangshan, et al.
Published: (2022)
by: Jing, Gangshan, et al.
Published: (2022)
From Explainability to Interpretability: Interpretable Policies in Reinforcement Learning Via Model Explanation
by: Li, Peilang, et al.
Published: (2025)
by: Li, Peilang, et al.
Published: (2025)
Superposition in Graph Neural Networks
by: Pertl, Lukas, et al.
Published: (2025)
by: Pertl, Lukas, et al.
Published: (2025)
Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages
by: Ma, Guozheng, et al.
Published: (2023)
by: Ma, Guozheng, et al.
Published: (2023)
Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models
by: Li, Zhaoxin, et al.
Published: (2025)
by: Li, Zhaoxin, et al.
Published: (2025)
Similar Items
-
Probabilistic Constrained Reinforcement Learning with Formal Interpretability
by: Wang, Yanran, et al.
Published: (2023) -
Convergent Linear Representations of Emergent Misalignment
by: Soligo, Anna, et al.
Published: (2025) -
Model Organisms for Emergent Misalignment
by: Turner, Edward, et al.
Published: (2025) -
Interpretable Hierarchical Concept Reasoning through Attention-Guided Graph Learning
by: Debot, David, et al.
Published: (2025) -
CALF: Communication-Aware Learning Framework for Distributed Reinforcement Learning
by: Purves, Carlos, et al.
Published: (2026)