:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Agarwal, Chirag, Ley, Dan, Krishna, Satyapriya, Saxena, Eshika, Pawelczyk, Martin, Johnson, Nari, Puri, Isha, Zitnik, Marinka, Lakkaraju, Himabindu
Format:	Preprint
Published:	2022
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2206.11104
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

On the Trade-offs between Adversarial Robustness and Actionable Explanations
by: Krishna, Satyapriya, et al.
Published: (2023)

In-Context Explainers: Harnessing LLMs for Explaining Black Box Models
by: Kroeger, Nicholas, et al.
Published: (2023)

Understanding the Effects of Iterative Prompting on Truthfulness
by: Krishna, Satyapriya, et al.
Published: (2024)

Explaining the Model, Protecting Your Data: Revealing and Mitigating the Data Privacy Risks of Post-Hoc Model Explanations via Membership Inference
by: Huang, Catherine, et al.
Published: (2024)

On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models
by: Tanneru, Sree Harsha, et al.
Published: (2024)

Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
by: Agarwal, Chirag, et al.
Published: (2024)

In-Context Unlearning: Language Models as Few Shot Unlearners
by: Pawelczyk, Martin, et al.
Published: (2023)

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
by: Li, Aaron J., et al.
Published: (2024)

On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
by: Lobo, Elita, et al.
Published: (2024)

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models
by: Han, Tessa, et al.
Published: (2024)

Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
by: Pawelczyk, Martin, et al.
Published: (2024)

The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective
by: Krishna, Satyapriya, et al.
Published: (2022)

Generalized Group Data Attribution
by: Ley, Dan, et al.
Published: (2024)

Certifying LLM Safety against Adversarial Prompting
by: Kumar, Aounon, et al.
Published: (2023)

Towards Unifying Interpretability and Control: Evaluation via Intervention
by: Bhalla, Usha, et al.
Published: (2024)

Learning Recourse Costs from Pairwise Feature Comparisons
by: Rawal, Kaivalya, et al.
Published: (2024)

Manipulating Large Language Models to Increase Product Visibility
by: Kumar, Aounon, et al.
Published: (2024)

Prompting Decision Transformers for Zero-Shot Reach-Avoid Policies
by: Li, Kevin, et al.
Published: (2025)

Ken Utilization Layer: Hebbian Replay Within a Student's Ken for Adaptive Exercise Recommendation
by: Kuling, Grey, et al.
Published: (2025)

Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability
by: Zhang, Shichang, et al.
Published: (2025)

Characterizing Data Point Vulnerability via Average-Case Robustness
by: Han, Tessa, et al.
Published: (2023)

Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning
by: Xiong, Zidi, et al.
Published: (2026)

Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability
by: Bhalla, Usha, et al.
Published: (2023)

AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
by: Gao, Shanghua, et al.
Published: (2026)

Generalized Protein Pocket Generation with Prior-Informed Flow Matching
by: Zhang, Zaixi, et al.
Published: (2024)

Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders
by: Li, Aaron J., et al.
Published: (2025)

Holistic Explainable AI (H-XAI): Extending Transparency Beyond Developers in AI-Driven Decision Making
by: Lakkaraju, Kausik, et al.
Published: (2025)

Towards Uncovering How Large Language Model Works: An Explainability Perspective
by: Zhao, Haiyan, et al.
Published: (2024)

OpenHEXAI: An Open-Source Framework for Human-Centered Evaluation of Explainable Machine Learning
by: Ma, Jiaqi, et al.
Published: (2024)

Towards Interpretable Soft Prompts
by: Patel, Oam, et al.
Published: (2025)

PyTDC: A multimodal machine learning training, evaluation, and inference platform for biomedical foundation models
by: Velez-Arce, Alejandro, et al.
Published: (2025)

Graph Representation Learning in Biomedicine
by: Li, Michelle M., et al.
Published: (2021)

Qworld: Question-Specific Evaluation Criteria for LLMs
by: Gao, Shanghua, et al.
Published: (2026)

Data Poisoning Attacks on Off-Policy Policy Evaluation Methods
by: Lobo, Elita, et al.
Published: (2024)

Evaluating Relational Reasoning in LLMs with REL
by: Fesser, Lukas, et al.
Published: (2026)

Confronting LLMs with Traditional ML: Rethinking the Fairness of Large Language Models in Tabular Classifications
by: Liu, Yanchen, et al.
Published: (2023)

Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models
by: Xiong, Zidi, et al.
Published: (2025)

Interpretability Needs a New Paradigm
by: Madsen, Andreas, et al.
Published: (2024)

Operationalizing the Blueprint for an AI Bill of Rights: Recommendations for Practitioners, Researchers, and Policy Makers
by: Oesterling, Alex, et al.
Published: (2024)

A System for Accurate Tracking and Video Recordings of Rodent Eye Movements using Convolutional Neural Networks for Biomedical Image Segmentation
by: Puri, Isha, et al.
Published: (2025)