Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Blandfort, Phil, Karayil, Tushar, McKenzie, Alex, Pawar, Urja, Graham, Robert, Krasheninnikov, Dmitrii
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition Computers and Society
Online Access:	https://arxiv.org/abs/2602.22831
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909024679624704
author	Blandfort, Phil Karayil, Tushar McKenzie, Alex Pawar, Urja Graham, Robert Krasheninnikov, Dmitrii
author_facet	Blandfort, Phil Karayil, Tushar McKenzie, Alex Pawar, Urja Graham, Robert Krasheninnikov, Dmitrii
contents	Moral benchmarks for LLMs typically score models on context-free prompts, implicitly treating the measured choice rate as stable. We test this assumption with a direction-flipped influence audit: for each scenario, we compare a baseline prompt with matched cues steering toward option A or option B. Across a trolley-problem-style moral triage task, BBQ, and DailyDilemmas, and across five LLM families with and without reasoning, short contextual cues shift per-condition choice rates by 12-18 percentage points on average. These shifts reveal structure that baseline scores miss: roughly 40% of baseline-neutral triage and BBQ conditions exhibit directional asymmetry under influence, and a meaningful share of significant effects backfire, moving opposite the cue's intended direction. In follow-up probes, models often recognize the cue while denying that it affected their choice. Among significant backfire trials, this stated-vs.-revealed inconsistency appears in 78% of cases. Reasoning does not eliminate contextual sensitivity but reshapes it: social-pressure cues such as user preference and emotional appeal weaken across benchmarks, while few-shot demonstrations strengthen sharply on both triage and BBQ. We recommend direction-flipped influence pairs as a standard complement to context-free moral-bias evaluation, and release the harness and data to make such audits routine.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_22831
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs Blandfort, Phil Karayil, Tushar McKenzie, Alex Pawar, Urja Graham, Robert Krasheninnikov, Dmitrii Machine Learning Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition Computers and Society Moral benchmarks for LLMs typically score models on context-free prompts, implicitly treating the measured choice rate as stable. We test this assumption with a direction-flipped influence audit: for each scenario, we compare a baseline prompt with matched cues steering toward option A or option B. Across a trolley-problem-style moral triage task, BBQ, and DailyDilemmas, and across five LLM families with and without reasoning, short contextual cues shift per-condition choice rates by 12-18 percentage points on average. These shifts reveal structure that baseline scores miss: roughly 40% of baseline-neutral triage and BBQ conditions exhibit directional asymmetry under influence, and a meaningful share of significant effects backfire, moving opposite the cue's intended direction. In follow-up probes, models often recognize the cue while denying that it affected their choice. Among significant backfire trials, this stated-vs.-revealed inconsistency appears in 78% of cases. Reasoning does not eliminate contextual sensitivity but reshapes it: social-pressure cues such as user preference and emotional appeal weaken across benchmarks, while few-shot demonstrations strengthen sharply on both triage and BBQ. We recommend direction-flipped influence pairs as a standard complement to context-free moral-bias evaluation, and release the harness and data to make such audits routine.
title	Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs
topic	Machine Learning Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition Computers and Society
url	https://arxiv.org/abs/2602.22831

Similar Items