Saved in:
| Main Authors: | Krishna, Satyapriya, Agarwal, Chirag, Lakkaraju, Himabindu |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.06625 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
In-Context Explainers: Harnessing LLMs for Explaining Black Box Models
by: Kroeger, Nicholas, et al.
Published: (2023)
by: Kroeger, Nicholas, et al.
Published: (2023)
On the Trade-offs between Adversarial Robustness and Actionable Explanations
by: Krishna, Satyapriya, et al.
Published: (2023)
by: Krishna, Satyapriya, et al.
Published: (2023)
More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
by: Li, Aaron J., et al.
Published: (2024)
by: Li, Aaron J., et al.
Published: (2024)
On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
by: Lobo, Elita, et al.
Published: (2024)
by: Lobo, Elita, et al.
Published: (2024)
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
by: Agarwal, Chirag, et al.
Published: (2024)
by: Agarwal, Chirag, et al.
Published: (2024)
On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models
by: Tanneru, Sree Harsha, et al.
Published: (2024)
by: Tanneru, Sree Harsha, et al.
Published: (2024)
Certifying LLM Safety against Adversarial Prompting
by: Kumar, Aounon, et al.
Published: (2023)
by: Kumar, Aounon, et al.
Published: (2023)
Manipulating Large Language Models to Increase Product Visibility
by: Kumar, Aounon, et al.
Published: (2024)
by: Kumar, Aounon, et al.
Published: (2024)
Confronting LLMs with Traditional ML: Rethinking the Fairness of Large Language Models in Tabular Classifications
by: Liu, Yanchen, et al.
Published: (2023)
by: Liu, Yanchen, et al.
Published: (2023)
Towards Interpretable Soft Prompts
by: Patel, Oam, et al.
Published: (2025)
by: Patel, Oam, et al.
Published: (2025)
Interpretability Needs a New Paradigm
by: Madsen, Andreas, et al.
Published: (2024)
by: Madsen, Andreas, et al.
Published: (2024)
How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence
by: Du, Hongzhe, et al.
Published: (2025)
by: Du, Hongzhe, et al.
Published: (2025)
MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models
by: Han, Tessa, et al.
Published: (2024)
by: Han, Tessa, et al.
Published: (2024)
OpenXAI: Towards a Transparent Evaluation of Model Explanations
by: Agarwal, Chirag, et al.
Published: (2022)
by: Agarwal, Chirag, et al.
Published: (2022)
Towards Uncovering How Large Language Model Works: An Explainability Perspective
by: Zhao, Haiyan, et al.
Published: (2024)
by: Zhao, Haiyan, et al.
Published: (2024)
Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders
by: Li, Aaron J., et al.
Published: (2025)
by: Li, Aaron J., et al.
Published: (2025)
Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL
by: Patel, Nyal, et al.
Published: (2025)
by: Patel, Nyal, et al.
Published: (2025)
The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective
by: Krishna, Satyapriya, et al.
Published: (2022)
by: Krishna, Satyapriya, et al.
Published: (2022)
Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems
by: Qi, Zhenting, et al.
Published: (2024)
by: Qi, Zhenting, et al.
Published: (2024)
Learning Recourse Costs from Pairwise Feature Comparisons
by: Rawal, Kaivalya, et al.
Published: (2024)
by: Rawal, Kaivalya, et al.
Published: (2024)
Towards Understanding the Robustness of Sparse Autoencoders
by: Saiyed, Ahson, et al.
Published: (2026)
by: Saiyed, Ahson, et al.
Published: (2026)
Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
by: Bhalla, Usha, et al.
Published: (2025)
by: Bhalla, Usha, et al.
Published: (2025)
Self-Improving Language Models with Bidirectional Evolutionary Search
by: Xu, Guowei, et al.
Published: (2026)
by: Xu, Guowei, et al.
Published: (2026)
Quantifying Generalization Complexity for Large Language Models
by: Qi, Zhenting, et al.
Published: (2024)
by: Qi, Zhenting, et al.
Published: (2024)
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
by: Krishna, Satyapriya, et al.
Published: (2024)
by: Krishna, Satyapriya, et al.
Published: (2024)
The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives
by: Bou, Matthieu, et al.
Published: (2025)
by: Bou, Matthieu, et al.
Published: (2025)
Toward Understanding Unlearning Difficulty: A Mechanistic Perspective and Circuit-Guided Difficulty Metric
by: Cheng, Jiali, et al.
Published: (2026)
by: Cheng, Jiali, et al.
Published: (2026)
Self-Correcting Large Language Models: Generation vs. Multiple Choice
by: Rahmani, Hossein A., et al.
Published: (2025)
by: Rahmani, Hossein A., et al.
Published: (2025)
Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability
by: Bhalla, Usha, et al.
Published: (2023)
by: Bhalla, Usha, et al.
Published: (2023)
Polarity-Aware Probing for Quantifying Latent Alignment in Language Models
by: Sadiekh, Sabrina, et al.
Published: (2025)
by: Sadiekh, Sabrina, et al.
Published: (2025)
A Study on the Calibration of In-context Learning
by: Zhang, Hanlin, et al.
Published: (2023)
by: Zhang, Hanlin, et al.
Published: (2023)
Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning
by: Joselowitz, Jared, et al.
Published: (2024)
by: Joselowitz, Jared, et al.
Published: (2024)
A Survey of Multilingual Reasoning in Language Models
by: Ghosh, Akash, et al.
Published: (2025)
by: Ghosh, Akash, et al.
Published: (2025)
Towards Operationalizing Right to Data Protection
by: Java, Abhinav, et al.
Published: (2024)
by: Java, Abhinav, et al.
Published: (2024)
EvoLM: In Search of Lost Language Model Training Dynamics
by: Qi, Zhenting, et al.
Published: (2025)
by: Qi, Zhenting, et al.
Published: (2025)
Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
by: Cheng, Jiali, et al.
Published: (2025)
by: Cheng, Jiali, et al.
Published: (2025)
Understanding Before Reasoning: Enhancing Chain-of-Thought with Iterative Summarization Pre-Prompting
by: Zhu, Dong-Hai, et al.
Published: (2025)
by: Zhu, Dong-Hai, et al.
Published: (2025)
SynopticBench: Evaluating Vision-Language Models on Generating Weather Forecast Discussions of the Future
by: Higgins, Timothy B., et al.
Published: (2026)
by: Higgins, Timothy B., et al.
Published: (2026)
Operationalizing the Blueprint for an AI Bill of Rights: Recommendations for Practitioners, Researchers, and Policy Makers
by: Oesterling, Alex, et al.
Published: (2024)
by: Oesterling, Alex, et al.
Published: (2024)
The Hard Positive Truth about Vision-Language Compositionality
by: Kamath, Amita, et al.
Published: (2024)
by: Kamath, Amita, et al.
Published: (2024)
Similar Items
-
In-Context Explainers: Harnessing LLMs for Explaining Black Box Models
by: Kroeger, Nicholas, et al.
Published: (2023) -
On the Trade-offs between Adversarial Robustness and Actionable Explanations
by: Krishna, Satyapriya, et al.
Published: (2023) -
More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
by: Li, Aaron J., et al.
Published: (2024) -
On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
by: Lobo, Elita, et al.
Published: (2024) -
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
by: Agarwal, Chirag, et al.
Published: (2024)