Guardado en:
| Autores principales: | Mayne, Harry, Kang, Justin Singh, Gould, Dewi, Ramchandran, Kannan, Mahdi, Adam, Siegel, Noah Y. |
|---|---|
| Formato: | Preprint |
| Publicado: |
2026
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2602.02639 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations
por: Siegel, Noah Y., et al.
Publicado: (2025)
por: Siegel, Noah Y., et al.
Publicado: (2025)
The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models
por: Siegel, Noah Y., et al.
Publicado: (2024)
por: Siegel, Noah Y., et al.
Publicado: (2024)
SPEX: Scaling Feature Interaction Explanations for LLMs
por: Kang, Justin Singh, et al.
Publicado: (2025)
por: Kang, Justin Singh, et al.
Publicado: (2025)
Quantifying Positional Biases in Text Embedding Models
por: Lee, Reagan J., et al.
Publicado: (2024)
por: Lee, Reagan J., et al.
Publicado: (2024)
Can sparse autoencoders be used to decompose and interpret steering vectors?
por: Mayne, Harry, et al.
Publicado: (2024)
por: Mayne, Harry, et al.
Publicado: (2024)
An Odd Estimator for Shapley Values
por: Fumagalli, Fabian, et al.
Publicado: (2026)
por: Fumagalli, Fabian, et al.
Publicado: (2026)
LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations
por: Mayne, Harry, et al.
Publicado: (2025)
por: Mayne, Harry, et al.
Publicado: (2025)
The Fair Value of Data Under Heterogeneous Privacy Constraints in Federated Learning
por: Kang, Justin, et al.
Publicado: (2023)
por: Kang, Justin, et al.
Publicado: (2023)
ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs
por: Butler, Landon, et al.
Publicado: (2025)
por: Butler, Landon, et al.
Publicado: (2025)
SAGE: A Realistic Benchmark for Semantic Understanding
por: Goel, Samarth, et al.
Publicado: (2025)
por: Goel, Samarth, et al.
Publicado: (2025)
EmbedLLM: Learning Compact Representations of Large Language Models
por: Zhuang, Richard, et al.
Publicado: (2024)
por: Zhuang, Richard, et al.
Publicado: (2024)
Adaptive Sparse Möbius Transforms for Learning Polynomials
por: Erginbas, Yigit Efe, et al.
Publicado: (2026)
por: Erginbas, Yigit Efe, et al.
Publicado: (2026)
Unsupervised Learning Approaches for Identifying ICU Patient Subgroups: Do Results Generalise?
por: Mayne, Harry, et al.
Publicado: (2024)
por: Mayne, Harry, et al.
Publicado: (2024)
Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs
por: Fragkathoulas, Christos, et al.
Publicado: (2024)
por: Fragkathoulas, Christos, et al.
Publicado: (2024)
Small Language Model Helps Resolve Semantic Ambiguity of LLM Prompt
por: Huang, Zhenzhen, et al.
Publicado: (2026)
por: Huang, Zhenzhen, et al.
Publicado: (2026)
FaithLM: Towards Faithful Explanations for Large Language Models
por: Chuang, Yu-Neng, et al.
Publicado: (2024)
por: Chuang, Yu-Neng, et al.
Publicado: (2024)
Towards Anytime-Valid Statistical Watermarking
por: Huang, Baihe, et al.
Publicado: (2026)
por: Huang, Baihe, et al.
Publicado: (2026)
Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance
por: Alon, Bar, et al.
Publicado: (2026)
por: Alon, Bar, et al.
Publicado: (2026)
Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations
por: Quan, Xin, et al.
Publicado: (2025)
por: Quan, Xin, et al.
Publicado: (2025)
The Effect of Model Size on LLM Post-hoc Explainability via LIME
por: Heyen, Henning, et al.
Publicado: (2024)
por: Heyen, Henning, et al.
Publicado: (2024)
Learning to Understand: Identifying Interactions via the Möbius Transform
por: Kang, Justin S., et al.
Publicado: (2024)
por: Kang, Justin S., et al.
Publicado: (2024)
SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges
por: Gould, Dewi S. W., et al.
Publicado: (2025)
por: Gould, Dewi S. W., et al.
Publicado: (2025)
Explanation-Driven Counterfactual Testing for Faithfulness in Vision-Language Model Explanations
por: Ding, Sihao, et al.
Publicado: (2025)
por: Ding, Sihao, et al.
Publicado: (2025)
Faithful and Plausible Natural Language Explanations for Image Classification: A Pipeline Approach
por: Wojciechowski, Adam, et al.
Publicado: (2024)
por: Wojciechowski, Adam, et al.
Publicado: (2024)
Large language models can help boost food production, but be mindful of their risks
por: De Clercq, Djavan, et al.
Publicado: (2024)
por: De Clercq, Djavan, et al.
Publicado: (2024)
On Measuring Faithfulness or Self-consistency of Natural Language Explanations
por: Parcalabescu, Letitia, et al.
Publicado: (2023)
por: Parcalabescu, Letitia, et al.
Publicado: (2023)
Neuro-Argumentative Learning with Case-Based Reasoning
por: Gould, Adam, et al.
Publicado: (2025)
por: Gould, Adam, et al.
Publicado: (2025)
LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring
por: Li, Chloe, et al.
Publicado: (2025)
por: Li, Chloe, et al.
Publicado: (2025)
Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models
por: Sovrano, Francesco, et al.
Publicado: (2026)
por: Sovrano, Francesco, et al.
Publicado: (2026)
DeepFaith: A Domain-Free and Model-Agnostic Unified Framework for Highly Faithful Explanations
por: Guo, Yuhan, et al.
Publicado: (2025)
por: Guo, Yuhan, et al.
Publicado: (2025)
AirTrafficGen: Configurable Air Traffic Scenario Generation with Large Language Models
por: Gould, Dewi Sid William, et al.
Publicado: (2025)
por: Gould, Dewi Sid William, et al.
Publicado: (2025)
How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis
por: Yang, Yushi, et al.
Publicado: (2024)
por: Yang, Yushi, et al.
Publicado: (2024)
A framework for assuring the accuracy and fidelity of an AI-enabled Digital Twin of en route UK airspace
por: Keane, Adam, et al.
Publicado: (2026)
por: Keane, Adam, et al.
Publicado: (2026)
Comparables XAI: Faithful Example-based AI Explanations with Counterfactual Trace Adjustments
por: Zhang, Yifan, et al.
Publicado: (2026)
por: Zhang, Yifan, et al.
Publicado: (2026)
Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations
por: Manna, Supriya, et al.
Publicado: (2024)
por: Manna, Supriya, et al.
Publicado: (2024)
Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations
por: Matton, Katie, et al.
Publicado: (2025)
por: Matton, Katie, et al.
Publicado: (2025)
LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation
por: Khouja, Jude, et al.
Publicado: (2025)
por: Khouja, Jude, et al.
Publicado: (2025)
Toward a Theory of Tokenization in LLMs
por: Rajaraman, Nived, et al.
Publicado: (2024)
por: Rajaraman, Nived, et al.
Publicado: (2024)
Negation Neglect: When models fail to learn negations in training
por: Mayne, Harry, et al.
Publicado: (2026)
por: Mayne, Harry, et al.
Publicado: (2026)
Evaluating Readability and Faithfulness of Concept-based Explanations
por: Li, Meng, et al.
Publicado: (2024)
por: Li, Meng, et al.
Publicado: (2024)
Ejemplares similares
-
Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations
por: Siegel, Noah Y., et al.
Publicado: (2025) -
The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models
por: Siegel, Noah Y., et al.
Publicado: (2024) -
SPEX: Scaling Feature Interaction Explanations for LLMs
por: Kang, Justin Singh, et al.
Publicado: (2025) -
Quantifying Positional Biases in Text Embedding Models
por: Lee, Reagan J., et al.
Publicado: (2024) -
Can sparse autoencoders be used to decompose and interpret steering vectors?
por: Mayne, Harry, et al.
Publicado: (2024)