:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Krishna, Satyapriya, Agarwal, Chirag, Lakkaraju, Himabindu
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2402.06625
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

In-Context Explainers: Harnessing LLMs for Explaining Black Box Models
by: Kroeger, Nicholas, et al.
Published: (2023)

On the Trade-offs between Adversarial Robustness and Actionable Explanations
by: Krishna, Satyapriya, et al.
Published: (2023)

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
by: Li, Aaron J., et al.
Published: (2024)

On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
by: Lobo, Elita, et al.
Published: (2024)

Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
by: Agarwal, Chirag, et al.
Published: (2024)

On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models
by: Tanneru, Sree Harsha, et al.
Published: (2024)

Certifying LLM Safety against Adversarial Prompting
by: Kumar, Aounon, et al.
Published: (2023)

Manipulating Large Language Models to Increase Product Visibility
by: Kumar, Aounon, et al.
Published: (2024)

Confronting LLMs with Traditional ML: Rethinking the Fairness of Large Language Models in Tabular Classifications
by: Liu, Yanchen, et al.
Published: (2023)

Towards Interpretable Soft Prompts
by: Patel, Oam, et al.
Published: (2025)

Interpretability Needs a New Paradigm
by: Madsen, Andreas, et al.
Published: (2024)

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence
by: Du, Hongzhe, et al.
Published: (2025)

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models
by: Han, Tessa, et al.
Published: (2024)

OpenXAI: Towards a Transparent Evaluation of Model Explanations
by: Agarwal, Chirag, et al.
Published: (2022)

Towards Uncovering How Large Language Model Works: An Explainability Perspective
by: Zhao, Haiyan, et al.
Published: (2024)

Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders
by: Li, Aaron J., et al.
Published: (2025)

Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL
by: Patel, Nyal, et al.
Published: (2025)

The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective
by: Krishna, Satyapriya, et al.
Published: (2022)

Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems
by: Qi, Zhenting, et al.
Published: (2024)

Learning Recourse Costs from Pairwise Feature Comparisons
by: Rawal, Kaivalya, et al.
Published: (2024)

Towards Understanding the Robustness of Sparse Autoencoders
by: Saiyed, Ahson, et al.
Published: (2026)

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
by: Bhalla, Usha, et al.
Published: (2025)

Self-Improving Language Models with Bidirectional Evolutionary Search
by: Xu, Guowei, et al.
Published: (2026)

Quantifying Generalization Complexity for Large Language Models
by: Qi, Zhenting, et al.
Published: (2024)

Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
by: Krishna, Satyapriya, et al.
Published: (2024)

The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives
by: Bou, Matthieu, et al.
Published: (2025)

Toward Understanding Unlearning Difficulty: A Mechanistic Perspective and Circuit-Guided Difficulty Metric
by: Cheng, Jiali, et al.
Published: (2026)

Self-Correcting Large Language Models: Generation vs. Multiple Choice
by: Rahmani, Hossein A., et al.
Published: (2025)

Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability
by: Bhalla, Usha, et al.
Published: (2023)

Polarity-Aware Probing for Quantifying Latent Alignment in Language Models
by: Sadiekh, Sabrina, et al.
Published: (2025)

A Study on the Calibration of In-context Learning
by: Zhang, Hanlin, et al.
Published: (2023)

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning
by: Joselowitz, Jared, et al.
Published: (2024)

A Survey of Multilingual Reasoning in Language Models
by: Ghosh, Akash, et al.
Published: (2025)

Towards Operationalizing Right to Data Protection
by: Java, Abhinav, et al.
Published: (2024)

EvoLM: In Search of Lost Language Model Training Dynamics
by: Qi, Zhenting, et al.
Published: (2025)

Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
by: Cheng, Jiali, et al.
Published: (2025)

Understanding Before Reasoning: Enhancing Chain-of-Thought with Iterative Summarization Pre-Prompting
by: Zhu, Dong-Hai, et al.
Published: (2025)

SynopticBench: Evaluating Vision-Language Models on Generating Weather Forecast Discussions of the Future
by: Higgins, Timothy B., et al.
Published: (2026)

Operationalizing the Blueprint for an AI Bill of Rights: Recommendations for Practitioners, Researchers, and Policy Makers
by: Oesterling, Alex, et al.
Published: (2024)

The Hard Positive Truth about Vision-Language Compositionality
by: Kamath, Amita, et al.
Published: (2024)