:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Oesterling, Alex, Ren, Donghao, Assogba, Yannick, Moritz, Dominik, Kim, Sunnie S. Y., Gatys, Leon, Hohman, Fred
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2605.05329
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments
by: Boggust, Angie, et al.
Published: (2024)

Evaluating Long Range Dependency Handling in Code Generation LLMs
by: Assogba, Yannick, et al.
Published: (2024)

Exploring Empty Spaces: Human-in-the-Loop Data Augmentation
by: Yeh, Catherine, et al.
Published: (2024)

Model Compression in Practice: Lessons Learned from Practitioners Creating On-device Machine Learning Experiences
by: Hohman, Fred, et al.
Published: (2023)

A Scalable Approach to Clustering Embedding Projections
by: Ren, Donghao, et al.
Published: (2025)

Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language
by: Boggust, Angie, et al.
Published: (2025)

Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment
by: Krishna, Kundan, et al.
Published: (2025)

Embedding Atlas: Low-Friction, Interactive Embedding Visualization
by: Ren, Donghao, et al.
Published: (2025)

Operationalizing the Blueprint for an AI Bill of Rights: Recommendations for Practitioners, Researchers, and Policy Makers
by: Oesterling, Alex, et al.
Published: (2024)

Policy Maps: Tools for Guiding the Unbounded Space of LLM Behaviors
by: Lam, Michelle S., et al.
Published: (2024)

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
by: Bhalla, Usha, et al.
Published: (2025)

Fair Machine Unlearning: Data Removal while Mitigating Disparities
by: Oesterling, Alex, et al.
Published: (2023)

VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
by: Palaskar, Shruti, et al.
Published: (2025)

Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference
by: Hohman, Fred, et al.
Published: (2024)

Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates
by: Prakash, Nikhil, et al.
Published: (2025)

EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts
by: Mukherjee, Kushin, et al.
Published: (2025)

PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming
by: Deng, Wesley Hanwen, et al.
Published: (2025)

PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
by: Deng, Wesley Hanwen, et al.
Published: (2026)

Reinforcement Learning with $ω$-Regular Objectives and Constraints
by: Wagner, Dominik, et al.
Published: (2025)

Interpreting and Controlling LLM Reasoning through Integrated Policy Gradient
by: Li, Changming, et al.
Published: (2026)

APRIL: Annotations for Policy evaluation with Reliable Inference from LLMs
by: Mandyam, Aishwarya, et al.
Published: (2025)

Understanding and Preserving Safety in Fine-Tuned LLMs
by: Zhang, Jiawen, et al.
Published: (2026)

Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment
by: Feng, Jingyuan, et al.
Published: (2026)

Counting Still Counts: Understanding Neural Complex Query Answering Through Query Relaxation
by: Brunink, Yannick, et al.
Published: (2025)

Interpretable Policy Distillation for Power Grid Topology Control
by: Dmitruka, Aleksandra, et al.
Published: (2026)

Interpretable and Editable Programmatic Tree Policies for Reinforcement Learning
by: Kohler, Hector, et al.
Published: (2024)

Pragmatic Policy Development via Interpretable Behavior Cloning
by: Matsson, Anton, et al.
Published: (2025)

Evaluating Interpretable Reinforcement Learning by Distilling Policies into Programs
by: Kohler, Hector, et al.
Published: (2025)

Multi-Group Proportional Representation in Retrieval
by: Oesterling, Alex, et al.
Published: (2024)

Annotation-Assisted Learning of Treatment Policies From Multimodal Electronic Health Records
by: Arno, Henri, et al.
Published: (2025)

An Analysis of Concept Bottleneck Models: Measuring, Understanding, and Mitigating the Impact of Noisy Annotations
by: Park, Seonghwan, et al.
Published: (2025)

Towards Interpretable Reinforcement Learning with Constrained Normalizing Flow Policies
by: Rietz, Finn, et al.
Published: (2024)

Machine Learning for Climate Policy: Understanding Policy Progression in the European Green Deal
by: West, Patricia, et al.
Published: (2025)

Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits
by: Li, Donghao, et al.
Published: (2026)

A Shared Low-Rank Adaptation Approach to Personalized RLHF
by: Liu, Renpu, et al.
Published: (2025)

Interpretable Predictability-Based AI Text Detection: A Replication Study
by: Skurla, Adam, et al.
Published: (2026)

Identifying Intervenable and Interpretable Features via Orthogonality Regularization
by: Miller, Moritz, et al.
Published: (2026)

Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning
by: Pravetz, Thomas
Published: (2026)

Three Pathways to Neurosymbolic Reinforcement Learning with Interpretable Model and Policy Networks
by: Graf, Peter, et al.
Published: (2024)

How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning
by: Weltevrede, Max, et al.
Published: (2025)