:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Soligo, Anna, Ferraro, Pietro, Boyle, David
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2501.17077
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Probabilistic Constrained Reinforcement Learning with Formal Interpretability
by: Wang, Yanran, et al.
Published: (2023)

Convergent Linear Representations of Emergent Misalignment
by: Soligo, Anna, et al.
Published: (2025)

Model Organisms for Emergent Misalignment
by: Turner, Edward, et al.
Published: (2025)

Interpretable Hierarchical Concept Reasoning through Attention-Guided Graph Learning
by: Debot, David, et al.
Published: (2025)

CALF: Communication-Aware Learning Framework for Distributed Reinforcement Learning
by: Purves, Carlos, et al.
Published: (2026)

Neural Network Conversion of Machine Learning Pipelines
by: Sung, Man-Ling, et al.
Published: (2026)

Neural Interpretable Reasoning
by: Barbiero, Pietro, et al.
Published: (2025)

Interpreting Emergent Planning in Model-Free Reinforcement Learning
by: Bush, Thomas, et al.
Published: (2025)

Actionable Interpretability via Causal Hypergraphs: Unravelling Batch Size Effects in Deep Learning
by: Sun, Zhongtian, et al.
Published: (2025)

Three Dogmas of Reinforcement Learning
by: Abel, David, et al.
Published: (2024)

IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking
by: Beigi, Mohammad, et al.
Published: (2026)

Interpretable Neural-Symbolic Concept Reasoning
by: Barbiero, Pietro, et al.
Published: (2023)

Interpretable Deep Reinforcement Learning for Element-level Bridge Life-cycle Optimization
by: Moayyedi, Seyyed Amirhossein, et al.
Published: (2026)

Towards Interpretable Deep Reinforcement Learning Models via Inverse Reinforcement Learning
by: Xie, Sean, et al.
Published: (2022)

Surrogate Fitness Metrics for Interpretable Reinforcement Learning
by: Altmann, Philipp, et al.
Published: (2025)

Interpretable Concept-Based Memory Reasoning
by: Debot, David, et al.
Published: (2024)

Interpretability by Design for Efficient Multi-Objective Reinforcement Learning
by: Xia, Qiyue, et al.
Published: (2025)

Evaluating Interpretable Reinforcement Learning by Distilling Policies into Programs
by: Kohler, Hector, et al.
Published: (2025)

Interpretable and Editable Programmatic Tree Policies for Reinforcement Learning
by: Kohler, Hector, et al.
Published: (2024)

The Interpretability of Codebooks in Model-Based Reinforcement Learning is Limited
by: Eaton, Kenneth, et al.
Published: (2024)

Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards
by: Zhang, Hanping, et al.
Published: (2025)

LICORICE: Label-Efficient Concept-Based Interpretable Reinforcement Learning
by: Ye, Zhuorui, et al.
Published: (2024)

Towards Interpretable Reinforcement Learning with Constrained Normalizing Flow Policies
by: Rietz, Finn, et al.
Published: (2024)

Compositional Function Networks: A High-Performance Alternative to Deep Neural Networks with Built-in Interpretability
by: Li, Fang
Published: (2025)

Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning
by: Yan, John, et al.
Published: (2026)

Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning
by: Pravetz, Thomas
Published: (2026)

Three Pathways to Neurosymbolic Reinforcement Learning with Interpretable Model and Policy Networks
by: Graf, Peter, et al.
Published: (2024)

On Leakage in Machine Learning Pipelines
by: Sasse, Leonard, et al.
Published: (2023)

NSF-MAP: Neurosymbolic Multimodal Fusion for Robust and Interpretable Anomaly Prediction in Assembly Pipelines
by: Shyalika, Chathurangi, et al.
Published: (2025)

Neural Approaches to SAT Solving: Design Choices and Interpretability
by: Mojžíšek, David, et al.
Published: (2025)

A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction
by: Huang, Rui, et al.
Published: (2026)

Learning Interpretable Policies in Hindsight-Observable POMDPs through Partially Supervised Reinforcement Learning
by: Lanier, Michael, et al.
Published: (2024)

Metric Learning for Clifford Group Equivariant Neural Networks
by: Ali, Riccardo, et al.
Published: (2024)

Neural Lyapunov Function Approximation with Self-Supervised Reinforcement Learning
by: McCutcheon, Luc, et al.
Published: (2025)

Towards Mechanistic Interpretability of Graph Transformers via Attention Graphs
by: El, Batu, et al.
Published: (2025)

Distributed Multi-Agent Reinforcement Learning Based on Graph-Induced Local Value Functions
by: Jing, Gangshan, et al.
Published: (2022)

From Explainability to Interpretability: Interpretable Policies in Reinforcement Learning Via Model Explanation
by: Li, Peilang, et al.
Published: (2025)

Superposition in Graph Neural Networks
by: Pertl, Lukas, et al.
Published: (2025)

Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages
by: Ma, Guozheng, et al.
Published: (2023)

Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models
by: Li, Zhaoxin, et al.
Published: (2025)