:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Xiong, Zidi, Chen, Shan, Qi, Zhenting, Lakkaraju, Himabindu
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.13774
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning
by: Xiong, Zidi, et al.
Published: (2026)

Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
by: Pawelczyk, Martin, et al.
Published: (2024)

Manipulating Large Language Models to Increase Product Visibility
by: Kumar, Aounon, et al.
Published: (2024)

D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting
by: Wu, Tianyu, et al.
Published: (2026)

Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems
by: Qi, Zhenting, et al.
Published: (2024)

How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior
by: Xiong, Zidi, et al.
Published: (2025)

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models
by: Han, Tessa, et al.
Published: (2024)

Learning Recourse Costs from Pairwise Feature Comparisons
by: Rawal, Kaivalya, et al.
Published: (2024)

In-Context Unlearning: Language Models as Few Shot Unlearners
by: Pawelczyk, Martin, et al.
Published: (2023)

EvoLM: In Search of Lost Language Model Training Dynamics
by: Qi, Zhenting, et al.
Published: (2025)

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
by: Li, Aaron J., et al.
Published: (2024)

On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models
by: Tanneru, Sree Harsha, et al.
Published: (2024)

Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability
by: Zhang, Shichang, et al.
Published: (2025)

Soft Best-of-n Sampling for Model Alignment
by: Verdun, Claudio Mayrink, et al.
Published: (2025)

On the Faithfulness of Visual Thinking: Measurement and Enhancement
by: Liu, Zujing, et al.
Published: (2025)

Operationalizing the Blueprint for an AI Bill of Rights: Recommendations for Practitioners, Researchers, and Policy Makers
by: Oesterling, Alex, et al.
Published: (2024)

Who Gets Credit or Blame? Attributing Accountability in Modern AI Systems
by: Zhang, Shichang, et al.
Published: (2025)

In-Context Explainers: Harnessing LLMs for Explaining Black Box Models
by: Kroeger, Nicholas, et al.
Published: (2023)

Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders
by: Li, Aaron J., et al.
Published: (2025)

Detecting LLM-Generated Peer Reviews
by: Rao, Vishisht, et al.
Published: (2025)

Generalized Group Data Attribution
by: Ley, Dan, et al.
Published: (2024)

Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
by: Agarwal, Chirag, et al.
Published: (2024)

Data Poisoning Attacks on Off-Policy Policy Evaluation Methods
by: Lobo, Elita, et al.
Published: (2024)

The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective
by: Krishna, Satyapriya, et al.
Published: (2022)

DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
by: Xu, Tianrun, et al.
Published: (2025)

Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
by: Cao, Jie, et al.
Published: (2026)

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
by: Bhalla, Usha, et al.
Published: (2025)

Computational Copyright: Towards A Royalty Model for Music Generative AI
by: Deng, Junwei, et al.
Published: (2023)

Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning
by: Luo, Linhao, et al.
Published: (2023)

Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation
by: Ma, Shengjie, et al.
Published: (2024)

OpenHEXAI: An Open-Source Framework for Human-Centered Evaluation of Explainable Machine Learning
by: Ma, Jiaqi, et al.
Published: (2024)

Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments
by: Ye, Bingyang, et al.
Published: (2026)

OpenXAI: Towards a Transparent Evaluation of Model Explanations
by: Agarwal, Chirag, et al.
Published: (2022)

Towards Interpretable Soft Prompts
by: Patel, Oam, et al.
Published: (2025)

Certifying LLM Safety against Adversarial Prompting
by: Kumar, Aounon, et al.
Published: (2023)

Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning
by: Zhang, Yujian, et al.
Published: (2025)

Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models
by: An, Sohyun, et al.
Published: (2025)

Thinking, Faithful and Stable: Mitigating Hallucinations in LLMs
by: Zou, Chelsea, et al.
Published: (2025)

RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
by: Han, Yunseok, et al.
Published: (2026)

Quantifying Generalization Complexity for Large Language Models
by: Qi, Zhenting, et al.
Published: (2024)