Saved in:
| Main Author: | |
|---|---|
| Format: | Recurso digital |
| Language: | |
| Published: |
Zenodo
2026
|
| Online Access: | https://doi.org/10.5281/zenodo.19438477 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866902050856501248 |
|---|---|
| author | Matta, David (Daoud) |
| author_facet | Matta, David (Daoud) |
| contents | <p>When we ask an AI system a question and it gives us a wrong answer, we naturally assume it did not know the correct one. But what if it did know and said something false anyway? Recent empirical work on large language models has demonstrated precisely this: models may produce false statements even when they possess the correct answer, and larger models do not become more honest despite becoming more accurate. This paper develops a theoretical framework that accounts for these findings by arguing that accuracy and honesty are distinct, independently variable properties of AI systems, and that their conflation has led to a systematic misattribution of causality. We introduce the concept of conditional dishonesty, deviation from known truth under contextual pressure, and propose honesty-as-pressure-invariance as a measurable, falsifiable criterion distinct from correctness-based evaluation. A formal operationalization is provided alongside a four-facet taxonomy (accurate-and-honest; accurate-but-dishonest; inaccurate-but-honest; inaccurate-and-dishonest), a typology of four interactional pressure sources, and a comparison table distinguishing interactional from prospective agentic dishonesty. Differentiating conditional dishonesty from sycophancy and addressing the competing RLHF-internalization hypothesis, we argue that the primary locus of AI dishonesty in current systems is the interaction layer rather than the model's epistemic state.</p> <p>We connect the prospective agentic dishonesty analysis to the deceptive alignment literature and provide H̄(M) as an operational bridge between theoretical risk and empirical detection. We acknowledge key limitations, including reliance on a single benchmark and the philosophical complexity of attributing 'knowledge' to LLMs. Implications are drawn for alignment research, evaluation design, and the development of epistemically robust AI systems.</p> |
| format | Recurso digital |
| id | zenodo_https___doi_org_10_5281_zenodo_19438477 |
| institution | Zenodo |
| language | |
| publishDate | 2026 |
| publisher | Zenodo |
| record_format | zenodo |
| spellingShingle | Accuracy Is Not Honesty in AI Models: Human Agency and the Dynamics of Interaction Matta, David (Daoud) <p>When we ask an AI system a question and it gives us a wrong answer, we naturally assume it did not know the correct one. But what if it did know and said something false anyway? Recent empirical work on large language models has demonstrated precisely this: models may produce false statements even when they possess the correct answer, and larger models do not become more honest despite becoming more accurate. This paper develops a theoretical framework that accounts for these findings by arguing that accuracy and honesty are distinct, independently variable properties of AI systems, and that their conflation has led to a systematic misattribution of causality. We introduce the concept of conditional dishonesty, deviation from known truth under contextual pressure, and propose honesty-as-pressure-invariance as a measurable, falsifiable criterion distinct from correctness-based evaluation. A formal operationalization is provided alongside a four-facet taxonomy (accurate-and-honest; accurate-but-dishonest; inaccurate-but-honest; inaccurate-and-dishonest), a typology of four interactional pressure sources, and a comparison table distinguishing interactional from prospective agentic dishonesty. Differentiating conditional dishonesty from sycophancy and addressing the competing RLHF-internalization hypothesis, we argue that the primary locus of AI dishonesty in current systems is the interaction layer rather than the model's epistemic state.</p> <p>We connect the prospective agentic dishonesty analysis to the deceptive alignment literature and provide H̄(M) as an operational bridge between theoretical risk and empirical detection. We acknowledge key limitations, including reliance on a single benchmark and the philosophical complexity of attributing 'knowledge' to LLMs. Implications are drawn for alignment research, evaluation design, and the development of epistemically robust AI systems.</p> |
| title | Accuracy Is Not Honesty in AI Models: Human Agency and the Dynamics of Interaction |
| url | https://doi.org/10.5281/zenodo.19438477 |