:: Library Catalog

Imagen de Portada

Guardado en:

Detalles Bibliográficos
Autores principales:	Mayne, Harry, Kang, Justin Singh, Gould, Dewi, Ramchandran, Kannan, Mahdi, Adam, Siegel, Noah Y.
Formato:	Preprint
Publicado:	2026
Materias:	Artificial Intelligence Machine Learning
Acceso en línea:	https://arxiv.org/abs/2602.02639
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Ejemplares similares

Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations
por: Siegel, Noah Y., et al.
Publicado: (2025)

The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models
por: Siegel, Noah Y., et al.
Publicado: (2024)

SPEX: Scaling Feature Interaction Explanations for LLMs
por: Kang, Justin Singh, et al.
Publicado: (2025)

Quantifying Positional Biases in Text Embedding Models
por: Lee, Reagan J., et al.
Publicado: (2024)

Can sparse autoencoders be used to decompose and interpret steering vectors?
por: Mayne, Harry, et al.
Publicado: (2024)

An Odd Estimator for Shapley Values
por: Fumagalli, Fabian, et al.
Publicado: (2026)

LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations
por: Mayne, Harry, et al.
Publicado: (2025)

The Fair Value of Data Under Heterogeneous Privacy Constraints in Federated Learning
por: Kang, Justin, et al.
Publicado: (2023)

ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs
por: Butler, Landon, et al.
Publicado: (2025)

SAGE: A Realistic Benchmark for Semantic Understanding
por: Goel, Samarth, et al.
Publicado: (2025)

EmbedLLM: Learning Compact Representations of Large Language Models
por: Zhuang, Richard, et al.
Publicado: (2024)

Adaptive Sparse Möbius Transforms for Learning Polynomials
por: Erginbas, Yigit Efe, et al.
Publicado: (2026)

Unsupervised Learning Approaches for Identifying ICU Patient Subgroups: Do Results Generalise?
por: Mayne, Harry, et al.
Publicado: (2024)

Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs
por: Fragkathoulas, Christos, et al.
Publicado: (2024)

Small Language Model Helps Resolve Semantic Ambiguity of LLM Prompt
por: Huang, Zhenzhen, et al.
Publicado: (2026)

FaithLM: Towards Faithful Explanations for Large Language Models
por: Chuang, Yu-Neng, et al.
Publicado: (2024)

Towards Anytime-Valid Statistical Watermarking
por: Huang, Baihe, et al.
Publicado: (2026)

Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance
por: Alon, Bar, et al.
Publicado: (2026)

Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations
por: Quan, Xin, et al.
Publicado: (2025)

The Effect of Model Size on LLM Post-hoc Explainability via LIME
por: Heyen, Henning, et al.
Publicado: (2024)

Learning to Understand: Identifying Interactions via the Möbius Transform
por: Kang, Justin S., et al.
Publicado: (2024)

SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges
por: Gould, Dewi S. W., et al.
Publicado: (2025)

Explanation-Driven Counterfactual Testing for Faithfulness in Vision-Language Model Explanations
por: Ding, Sihao, et al.
Publicado: (2025)

Faithful and Plausible Natural Language Explanations for Image Classification: A Pipeline Approach
por: Wojciechowski, Adam, et al.
Publicado: (2024)

Large language models can help boost food production, but be mindful of their risks
por: De Clercq, Djavan, et al.
Publicado: (2024)

On Measuring Faithfulness or Self-consistency of Natural Language Explanations
por: Parcalabescu, Letitia, et al.
Publicado: (2023)

Neuro-Argumentative Learning with Case-Based Reasoning
por: Gould, Adam, et al.
Publicado: (2025)

LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring
por: Li, Chloe, et al.
Publicado: (2025)

Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models
por: Sovrano, Francesco, et al.
Publicado: (2026)

DeepFaith: A Domain-Free and Model-Agnostic Unified Framework for Highly Faithful Explanations
por: Guo, Yuhan, et al.
Publicado: (2025)

AirTrafficGen: Configurable Air Traffic Scenario Generation with Large Language Models
por: Gould, Dewi Sid William, et al.
Publicado: (2025)

How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis
por: Yang, Yushi, et al.
Publicado: (2024)

A framework for assuring the accuracy and fidelity of an AI-enabled Digital Twin of en route UK airspace
por: Keane, Adam, et al.
Publicado: (2026)

Comparables XAI: Faithful Example-based AI Explanations with Counterfactual Trace Adjustments
por: Zhang, Yifan, et al.
Publicado: (2026)

Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations
por: Manna, Supriya, et al.
Publicado: (2024)

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations
por: Matton, Katie, et al.
Publicado: (2025)

LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation
por: Khouja, Jude, et al.
Publicado: (2025)

Toward a Theory of Tokenization in LLMs
por: Rajaraman, Nived, et al.
Publicado: (2024)

Negation Neglect: When models fail to learn negations in training
por: Mayne, Harry, et al.
Publicado: (2026)

Evaluating Readability and Faithfulness of Concept-based Explanations
por: Li, Meng, et al.
Publicado: (2024)