Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Matta, David (Daoud)
Format:	Recurso digital
Language:
Published:	Zenodo 2026
Online Access:	https://doi.org/10.5281/zenodo.19438477
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866902050856501248
author	Matta, David (Daoud)
author_facet	Matta, David (Daoud)
contents	<p>When we ask an AI system a question and it gives us a wrong answer, we naturally assume it did not know the correct one. But what if it did know and said something false anyway? Recent empirical work on large language models has demonstrated precisely this: models may produce false statements even when they possess the correct answer, and larger models do not become more honest despite becoming more accurate. This paper develops a theoretical framework that accounts for these findings by arguing that accuracy and honesty are distinct, independently variable properties of AI systems, and that their conflation has led to a systematic misattribution of causality. We introduce the concept of conditional dishonesty, deviation from known truth under contextual pressure, and propose honesty-as-pressure-invariance as a measurable, falsifiable criterion distinct from correctness-based evaluation. A formal operationalization is provided alongside a four-facet taxonomy (accurate-and-honest; accurate-but-dishonest; inaccurate-but-honest; inaccurate-and-dishonest), a typology of four interactional pressure sources, and a comparison table distinguishing interactional from prospective agentic dishonesty. Differentiating conditional dishonesty from sycophancy and addressing the competing RLHF-internalization hypothesis, we argue that the primary locus of AI dishonesty in current systems is the interaction layer rather than the model's epistemic state.</p> <p>We connect the prospective agentic dishonesty analysis to the deceptive alignment literature and provide H̄(M) as an operational bridge between theoretical risk and empirical detection. We acknowledge key limitations, including reliance on a single benchmark and the philosophical complexity of attributing 'knowledge' to LLMs. Implications are drawn for alignment research, evaluation design, and the development of epistemically robust AI systems.</p>
format	Recurso digital
id	zenodo_https___doi_org_10_5281_zenodo_19438477
institution	Zenodo
language
publishDate	2026
publisher	Zenodo
record_format	zenodo
spellingShingle	Accuracy Is Not Honesty in AI Models: Human Agency and the Dynamics of Interaction Matta, David (Daoud) <p>When we ask an AI system a question and it gives us a wrong answer, we naturally assume it did not know the correct one. But what if it did know and said something false anyway? Recent empirical work on large language models has demonstrated precisely this: models may produce false statements even when they possess the correct answer, and larger models do not become more honest despite becoming more accurate. This paper develops a theoretical framework that accounts for these findings by arguing that accuracy and honesty are distinct, independently variable properties of AI systems, and that their conflation has led to a systematic misattribution of causality. We introduce the concept of conditional dishonesty, deviation from known truth under contextual pressure, and propose honesty-as-pressure-invariance as a measurable, falsifiable criterion distinct from correctness-based evaluation. A formal operationalization is provided alongside a four-facet taxonomy (accurate-and-honest; accurate-but-dishonest; inaccurate-but-honest; inaccurate-and-dishonest), a typology of four interactional pressure sources, and a comparison table distinguishing interactional from prospective agentic dishonesty. Differentiating conditional dishonesty from sycophancy and addressing the competing RLHF-internalization hypothesis, we argue that the primary locus of AI dishonesty in current systems is the interaction layer rather than the model's epistemic state.</p> <p>We connect the prospective agentic dishonesty analysis to the deceptive alignment literature and provide H̄(M) as an operational bridge between theoretical risk and empirical detection. We acknowledge key limitations, including reliance on a single benchmark and the philosophical complexity of attributing 'knowledge' to LLMs. Implications are drawn for alignment research, evaluation design, and the development of epistemically robust AI systems.</p>
title	Accuracy Is Not Honesty in AI Models: Human Agency and the Dynamics of Interaction
url	https://doi.org/10.5281/zenodo.19438477

Similar Items