Saved in:
| Main Authors: | Agarwal, Chirag, Ley, Dan, Krishna, Satyapriya, Saxena, Eshika, Pawelczyk, Martin, Johnson, Nari, Puri, Isha, Zitnik, Marinka, Lakkaraju, Himabindu |
|---|---|
| Format: | Preprint |
| Published: |
2022
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2206.11104 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
On the Trade-offs between Adversarial Robustness and Actionable Explanations
by: Krishna, Satyapriya, et al.
Published: (2023)
by: Krishna, Satyapriya, et al.
Published: (2023)
In-Context Explainers: Harnessing LLMs for Explaining Black Box Models
by: Kroeger, Nicholas, et al.
Published: (2023)
by: Kroeger, Nicholas, et al.
Published: (2023)
Understanding the Effects of Iterative Prompting on Truthfulness
by: Krishna, Satyapriya, et al.
Published: (2024)
by: Krishna, Satyapriya, et al.
Published: (2024)
Explaining the Model, Protecting Your Data: Revealing and Mitigating the Data Privacy Risks of Post-Hoc Model Explanations via Membership Inference
by: Huang, Catherine, et al.
Published: (2024)
by: Huang, Catherine, et al.
Published: (2024)
On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models
by: Tanneru, Sree Harsha, et al.
Published: (2024)
by: Tanneru, Sree Harsha, et al.
Published: (2024)
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
by: Agarwal, Chirag, et al.
Published: (2024)
by: Agarwal, Chirag, et al.
Published: (2024)
In-Context Unlearning: Language Models as Few Shot Unlearners
by: Pawelczyk, Martin, et al.
Published: (2023)
by: Pawelczyk, Martin, et al.
Published: (2023)
More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
by: Li, Aaron J., et al.
Published: (2024)
by: Li, Aaron J., et al.
Published: (2024)
On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
by: Lobo, Elita, et al.
Published: (2024)
by: Lobo, Elita, et al.
Published: (2024)
MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models
by: Han, Tessa, et al.
Published: (2024)
by: Han, Tessa, et al.
Published: (2024)
Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
by: Pawelczyk, Martin, et al.
Published: (2024)
by: Pawelczyk, Martin, et al.
Published: (2024)
The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective
by: Krishna, Satyapriya, et al.
Published: (2022)
by: Krishna, Satyapriya, et al.
Published: (2022)
Generalized Group Data Attribution
by: Ley, Dan, et al.
Published: (2024)
by: Ley, Dan, et al.
Published: (2024)
Certifying LLM Safety against Adversarial Prompting
by: Kumar, Aounon, et al.
Published: (2023)
by: Kumar, Aounon, et al.
Published: (2023)
Towards Unifying Interpretability and Control: Evaluation via Intervention
by: Bhalla, Usha, et al.
Published: (2024)
by: Bhalla, Usha, et al.
Published: (2024)
Learning Recourse Costs from Pairwise Feature Comparisons
by: Rawal, Kaivalya, et al.
Published: (2024)
by: Rawal, Kaivalya, et al.
Published: (2024)
Manipulating Large Language Models to Increase Product Visibility
by: Kumar, Aounon, et al.
Published: (2024)
by: Kumar, Aounon, et al.
Published: (2024)
Prompting Decision Transformers for Zero-Shot Reach-Avoid Policies
by: Li, Kevin, et al.
Published: (2025)
by: Li, Kevin, et al.
Published: (2025)
Ken Utilization Layer: Hebbian Replay Within a Student's Ken for Adaptive Exercise Recommendation
by: Kuling, Grey, et al.
Published: (2025)
by: Kuling, Grey, et al.
Published: (2025)
Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability
by: Zhang, Shichang, et al.
Published: (2025)
by: Zhang, Shichang, et al.
Published: (2025)
Characterizing Data Point Vulnerability via Average-Case Robustness
by: Han, Tessa, et al.
Published: (2023)
by: Han, Tessa, et al.
Published: (2023)
Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning
by: Xiong, Zidi, et al.
Published: (2026)
by: Xiong, Zidi, et al.
Published: (2026)
Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability
by: Bhalla, Usha, et al.
Published: (2023)
by: Bhalla, Usha, et al.
Published: (2023)
AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
by: Gao, Shanghua, et al.
Published: (2026)
by: Gao, Shanghua, et al.
Published: (2026)
Generalized Protein Pocket Generation with Prior-Informed Flow Matching
by: Zhang, Zaixi, et al.
Published: (2024)
by: Zhang, Zaixi, et al.
Published: (2024)
Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders
by: Li, Aaron J., et al.
Published: (2025)
by: Li, Aaron J., et al.
Published: (2025)
Holistic Explainable AI (H-XAI): Extending Transparency Beyond Developers in AI-Driven Decision Making
by: Lakkaraju, Kausik, et al.
Published: (2025)
by: Lakkaraju, Kausik, et al.
Published: (2025)
Towards Uncovering How Large Language Model Works: An Explainability Perspective
by: Zhao, Haiyan, et al.
Published: (2024)
by: Zhao, Haiyan, et al.
Published: (2024)
OpenHEXAI: An Open-Source Framework for Human-Centered Evaluation of Explainable Machine Learning
by: Ma, Jiaqi, et al.
Published: (2024)
by: Ma, Jiaqi, et al.
Published: (2024)
Towards Interpretable Soft Prompts
by: Patel, Oam, et al.
Published: (2025)
by: Patel, Oam, et al.
Published: (2025)
PyTDC: A multimodal machine learning training, evaluation, and inference platform for biomedical foundation models
by: Velez-Arce, Alejandro, et al.
Published: (2025)
by: Velez-Arce, Alejandro, et al.
Published: (2025)
Graph Representation Learning in Biomedicine
by: Li, Michelle M., et al.
Published: (2021)
by: Li, Michelle M., et al.
Published: (2021)
Qworld: Question-Specific Evaluation Criteria for LLMs
by: Gao, Shanghua, et al.
Published: (2026)
by: Gao, Shanghua, et al.
Published: (2026)
Data Poisoning Attacks on Off-Policy Policy Evaluation Methods
by: Lobo, Elita, et al.
Published: (2024)
by: Lobo, Elita, et al.
Published: (2024)
Evaluating Relational Reasoning in LLMs with REL
by: Fesser, Lukas, et al.
Published: (2026)
by: Fesser, Lukas, et al.
Published: (2026)
Confronting LLMs with Traditional ML: Rethinking the Fairness of Large Language Models in Tabular Classifications
by: Liu, Yanchen, et al.
Published: (2023)
by: Liu, Yanchen, et al.
Published: (2023)
Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models
by: Xiong, Zidi, et al.
Published: (2025)
by: Xiong, Zidi, et al.
Published: (2025)
Interpretability Needs a New Paradigm
by: Madsen, Andreas, et al.
Published: (2024)
by: Madsen, Andreas, et al.
Published: (2024)
Operationalizing the Blueprint for an AI Bill of Rights: Recommendations for Practitioners, Researchers, and Policy Makers
by: Oesterling, Alex, et al.
Published: (2024)
by: Oesterling, Alex, et al.
Published: (2024)
A System for Accurate Tracking and Video Recordings of Rodent Eye Movements using Convolutional Neural Networks for Biomedical Image Segmentation
by: Puri, Isha, et al.
Published: (2025)
by: Puri, Isha, et al.
Published: (2025)
Similar Items
-
On the Trade-offs between Adversarial Robustness and Actionable Explanations
by: Krishna, Satyapriya, et al.
Published: (2023) -
In-Context Explainers: Harnessing LLMs for Explaining Black Box Models
by: Kroeger, Nicholas, et al.
Published: (2023) -
Understanding the Effects of Iterative Prompting on Truthfulness
by: Krishna, Satyapriya, et al.
Published: (2024) -
Explaining the Model, Protecting Your Data: Revealing and Mitigating the Data Privacy Risks of Post-Hoc Model Explanations via Membership Inference
by: Huang, Catherine, et al.
Published: (2024) -
On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models
by: Tanneru, Sree Harsha, et al.
Published: (2024)