:: Library Catalog

Image de couverture de livre

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Xiong, Zidi, Lin, Yuping, Xie, Wenya, He, Pengfei, Liu, Zirui, Tang, Jiliang, Lakkaraju, Himabindu, Xiang, Zhen
Format:	Preprint
Publié:	2025
Sujets:	Artificial Intelligence
Accès en ligne:	https://arxiv.org/abs/2505.16067
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

Documents similaires

Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning
par: Xiong, Zidi, et autres
Publié: (2026)

Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models
par: Xiong, Zidi, et autres
Publié: (2025)

Unveiling Privacy Risks in LLM Agent Memory
par: Wang, Bo, et autres
Publié: (2025)

Memory Injection Attacks on LLM Agents via Query-Only Interaction
par: Dong, Shen, et autres
Publié: (2025)

On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
par: Lobo, Elita, et autres
Publié: (2024)

Learning Recourse Costs from Pairwise Feature Comparisons
par: Rawal, Kaivalya, et autres
Publié: (2024)

Manipulating Large Language Models to Increase Product Visibility
par: Kumar, Aounon, et autres
Publié: (2024)

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
par: Li, Aaron J., et autres
Publié: (2024)

Multi-Faceted Studies on Data Poisoning can Advance LLM Development
par: He, Pengfei, et autres
Publié: (2025)

Detecting LLM-Generated Peer Reviews
par: Rao, Vishisht, et autres
Publié: (2025)

In-Context Unlearning: Language Models as Few Shot Unlearners
par: Pawelczyk, Martin, et autres
Publié: (2023)

Understanding the Effects of Iterative Prompting on Truthfulness
par: Krishna, Satyapriya, et autres
Publié: (2024)

Characterizing Data Point Vulnerability via Average-Case Robustness
par: Han, Tessa, et autres
Publié: (2023)

Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability
par: Bhalla, Usha, et autres
Publié: (2023)

On the Trade-offs between Adversarial Robustness and Actionable Explanations
par: Krishna, Satyapriya, et autres
Publié: (2023)

Explaining the Model, Protecting Your Data: Revealing and Mitigating the Data Privacy Risks of Post-Hoc Model Explanations via Membership Inference
par: Huang, Catherine, et autres
Publié: (2024)

Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems
par: Qi, Zhenting, et autres
Publié: (2024)

CBD: A Certified Backdoor Detector Based on Local Dominant Probability
par: Xiang, Zhen, et autres
Publié: (2023)

When Continual Learning Moves to Memory: A Study of Experience Reuse in LLM Agents
par: Hu, Qisheng, et autres
Publié: (2026)

Towards Uncovering How Large Language Model Works: An Explainability Perspective
par: Zhao, Haiyan, et autres
Publié: (2024)

Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
par: Agarwal, Chirag, et autres
Publié: (2024)

A Simple Plug-in for Improving Eviction-Based KV Cache Compression
par: Lin, Yuping, et autres
Publié: (2026)

Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability
par: Zhang, Shichang, et autres
Publié: (2025)

Confronting LLMs with Traditional ML: Rethinking the Fairness of Large Language Models in Tabular Classifications
par: Liu, Yanchen, et autres
Publié: (2023)

Towards Unifying Interpretability and Control: Evaluation via Intervention
par: Bhalla, Usha, et autres
Publié: (2024)

Interpretability Needs a New Paradigm
par: Madsen, Andreas, et autres
Publié: (2024)

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models
par: Han, Tessa, et autres
Publié: (2024)

Operationalizing the Blueprint for an AI Bill of Rights: Recommendations for Practitioners, Researchers, and Policy Makers
par: Oesterling, Alex, et autres
Publié: (2024)

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning
par: Xiang, Zhen, et autres
Publié: (2024)

Certifying LLM Safety against Adversarial Prompting
par: Kumar, Aounon, et autres
Publié: (2023)

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
par: Lin, Yuping, et autres
Publié: (2024)

To trust or not to trust: Attention-based Trust Management for LLM Multi-Agent Systems
par: He, Pengfei, et autres
Publié: (2025)

Who Gets Credit or Blame? Attributing Accountability in Modern AI Systems
par: Zhang, Shichang, et autres
Publié: (2025)

On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models
par: Tanneru, Sree Harsha, et autres
Publié: (2024)

Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders
par: Li, Aaron J., et autres
Publié: (2025)

Crafting Reversible SFT Behaviors in Large Language Models
par: Lin, Yuping, et autres
Publié: (2026)

Soft Best-of-n Sampling for Model Alignment
par: Verdun, Claudio Mayrink, et autres
Publié: (2025)

Data Poisoning Attacks on Off-Policy Policy Evaluation Methods
par: Lobo, Elita, et autres
Publié: (2024)

Generalized Group Data Attribution
par: Ley, Dan, et autres
Publié: (2024)

Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
par: Pawelczyk, Martin, et autres
Publié: (2024)