Saved in:
| Main Authors: | Xiong, Zidi, Chen, Shan, Qi, Zhenting, Lakkaraju, Himabindu |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.13774 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning
by: Xiong, Zidi, et al.
Published: (2026)
by: Xiong, Zidi, et al.
Published: (2026)
Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
by: Pawelczyk, Martin, et al.
Published: (2024)
by: Pawelczyk, Martin, et al.
Published: (2024)
Manipulating Large Language Models to Increase Product Visibility
by: Kumar, Aounon, et al.
Published: (2024)
by: Kumar, Aounon, et al.
Published: (2024)
D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting
by: Wu, Tianyu, et al.
Published: (2026)
by: Wu, Tianyu, et al.
Published: (2026)
Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems
by: Qi, Zhenting, et al.
Published: (2024)
by: Qi, Zhenting, et al.
Published: (2024)
How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior
by: Xiong, Zidi, et al.
Published: (2025)
by: Xiong, Zidi, et al.
Published: (2025)
MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models
by: Han, Tessa, et al.
Published: (2024)
by: Han, Tessa, et al.
Published: (2024)
Learning Recourse Costs from Pairwise Feature Comparisons
by: Rawal, Kaivalya, et al.
Published: (2024)
by: Rawal, Kaivalya, et al.
Published: (2024)
In-Context Unlearning: Language Models as Few Shot Unlearners
by: Pawelczyk, Martin, et al.
Published: (2023)
by: Pawelczyk, Martin, et al.
Published: (2023)
EvoLM: In Search of Lost Language Model Training Dynamics
by: Qi, Zhenting, et al.
Published: (2025)
by: Qi, Zhenting, et al.
Published: (2025)
More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
by: Li, Aaron J., et al.
Published: (2024)
by: Li, Aaron J., et al.
Published: (2024)
On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models
by: Tanneru, Sree Harsha, et al.
Published: (2024)
by: Tanneru, Sree Harsha, et al.
Published: (2024)
Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability
by: Zhang, Shichang, et al.
Published: (2025)
by: Zhang, Shichang, et al.
Published: (2025)
Soft Best-of-n Sampling for Model Alignment
by: Verdun, Claudio Mayrink, et al.
Published: (2025)
by: Verdun, Claudio Mayrink, et al.
Published: (2025)
On the Faithfulness of Visual Thinking: Measurement and Enhancement
by: Liu, Zujing, et al.
Published: (2025)
by: Liu, Zujing, et al.
Published: (2025)
Operationalizing the Blueprint for an AI Bill of Rights: Recommendations for Practitioners, Researchers, and Policy Makers
by: Oesterling, Alex, et al.
Published: (2024)
by: Oesterling, Alex, et al.
Published: (2024)
Who Gets Credit or Blame? Attributing Accountability in Modern AI Systems
by: Zhang, Shichang, et al.
Published: (2025)
by: Zhang, Shichang, et al.
Published: (2025)
In-Context Explainers: Harnessing LLMs for Explaining Black Box Models
by: Kroeger, Nicholas, et al.
Published: (2023)
by: Kroeger, Nicholas, et al.
Published: (2023)
Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders
by: Li, Aaron J., et al.
Published: (2025)
by: Li, Aaron J., et al.
Published: (2025)
Detecting LLM-Generated Peer Reviews
by: Rao, Vishisht, et al.
Published: (2025)
by: Rao, Vishisht, et al.
Published: (2025)
Generalized Group Data Attribution
by: Ley, Dan, et al.
Published: (2024)
by: Ley, Dan, et al.
Published: (2024)
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
by: Agarwal, Chirag, et al.
Published: (2024)
by: Agarwal, Chirag, et al.
Published: (2024)
Data Poisoning Attacks on Off-Policy Policy Evaluation Methods
by: Lobo, Elita, et al.
Published: (2024)
by: Lobo, Elita, et al.
Published: (2024)
The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective
by: Krishna, Satyapriya, et al.
Published: (2022)
by: Krishna, Satyapriya, et al.
Published: (2022)
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
by: Xu, Tianrun, et al.
Published: (2025)
by: Xu, Tianrun, et al.
Published: (2025)
Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
by: Cao, Jie, et al.
Published: (2026)
by: Cao, Jie, et al.
Published: (2026)
Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
by: Bhalla, Usha, et al.
Published: (2025)
by: Bhalla, Usha, et al.
Published: (2025)
Computational Copyright: Towards A Royalty Model for Music Generative AI
by: Deng, Junwei, et al.
Published: (2023)
by: Deng, Junwei, et al.
Published: (2023)
Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning
by: Luo, Linhao, et al.
Published: (2023)
by: Luo, Linhao, et al.
Published: (2023)
Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation
by: Ma, Shengjie, et al.
Published: (2024)
by: Ma, Shengjie, et al.
Published: (2024)
OpenHEXAI: An Open-Source Framework for Human-Centered Evaluation of Explainable Machine Learning
by: Ma, Jiaqi, et al.
Published: (2024)
by: Ma, Jiaqi, et al.
Published: (2024)
Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments
by: Ye, Bingyang, et al.
Published: (2026)
by: Ye, Bingyang, et al.
Published: (2026)
OpenXAI: Towards a Transparent Evaluation of Model Explanations
by: Agarwal, Chirag, et al.
Published: (2022)
by: Agarwal, Chirag, et al.
Published: (2022)
Towards Interpretable Soft Prompts
by: Patel, Oam, et al.
Published: (2025)
by: Patel, Oam, et al.
Published: (2025)
Certifying LLM Safety against Adversarial Prompting
by: Kumar, Aounon, et al.
Published: (2023)
by: Kumar, Aounon, et al.
Published: (2023)
Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning
by: Zhang, Yujian, et al.
Published: (2025)
by: Zhang, Yujian, et al.
Published: (2025)
Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models
by: An, Sohyun, et al.
Published: (2025)
by: An, Sohyun, et al.
Published: (2025)
Thinking, Faithful and Stable: Mitigating Hallucinations in LLMs
by: Zou, Chelsea, et al.
Published: (2025)
by: Zou, Chelsea, et al.
Published: (2025)
RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
by: Han, Yunseok, et al.
Published: (2026)
by: Han, Yunseok, et al.
Published: (2026)
Quantifying Generalization Complexity for Large Language Models
by: Qi, Zhenting, et al.
Published: (2024)
by: Qi, Zhenting, et al.
Published: (2024)
Similar Items
-
Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning
by: Xiong, Zidi, et al.
Published: (2026) -
Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
by: Pawelczyk, Martin, et al.
Published: (2024) -
Manipulating Large Language Models to Increase Product Visibility
by: Kumar, Aounon, et al.
Published: (2024) -
D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting
by: Wu, Tianyu, et al.
Published: (2026) -
Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems
by: Qi, Zhenting, et al.
Published: (2024)