Saved in:
Bibliographic Details
Main Authors: Blandfort, Phil, Karayil, Tushar, McKenzie, Alex, Pawar, Urja, Graham, Robert, Krasheninnikov, Dmitrii
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.22831
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909024679624704
author Blandfort, Phil
Karayil, Tushar
McKenzie, Alex
Pawar, Urja
Graham, Robert
Krasheninnikov, Dmitrii
author_facet Blandfort, Phil
Karayil, Tushar
McKenzie, Alex
Pawar, Urja
Graham, Robert
Krasheninnikov, Dmitrii
contents Moral benchmarks for LLMs typically score models on context-free prompts, implicitly treating the measured choice rate as stable. We test this assumption with a direction-flipped influence audit: for each scenario, we compare a baseline prompt with matched cues steering toward option A or option B. Across a trolley-problem-style moral triage task, BBQ, and DailyDilemmas, and across five LLM families with and without reasoning, short contextual cues shift per-condition choice rates by 12-18 percentage points on average. These shifts reveal structure that baseline scores miss: roughly 40% of baseline-neutral triage and BBQ conditions exhibit directional asymmetry under influence, and a meaningful share of significant effects backfire, moving opposite the cue's intended direction. In follow-up probes, models often recognize the cue while denying that it affected their choice. Among significant backfire trials, this stated-vs.-revealed inconsistency appears in 78% of cases. Reasoning does not eliminate contextual sensitivity but reshapes it: social-pressure cues such as user preference and emotional appeal weaken across benchmarks, while few-shot demonstrations strengthen sharply on both triage and BBQ. We recommend direction-flipped influence pairs as a standard complement to context-free moral-bias evaluation, and release the harness and data to make such audits routine.
format Preprint
id arxiv_https___arxiv_org_abs_2602_22831
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs
Blandfort, Phil
Karayil, Tushar
McKenzie, Alex
Pawar, Urja
Graham, Robert
Krasheninnikov, Dmitrii
Machine Learning
Artificial Intelligence
Computation and Language
Computer Vision and Pattern Recognition
Computers and Society
Moral benchmarks for LLMs typically score models on context-free prompts, implicitly treating the measured choice rate as stable. We test this assumption with a direction-flipped influence audit: for each scenario, we compare a baseline prompt with matched cues steering toward option A or option B. Across a trolley-problem-style moral triage task, BBQ, and DailyDilemmas, and across five LLM families with and without reasoning, short contextual cues shift per-condition choice rates by 12-18 percentage points on average. These shifts reveal structure that baseline scores miss: roughly 40% of baseline-neutral triage and BBQ conditions exhibit directional asymmetry under influence, and a meaningful share of significant effects backfire, moving opposite the cue's intended direction. In follow-up probes, models often recognize the cue while denying that it affected their choice. Among significant backfire trials, this stated-vs.-revealed inconsistency appears in 78% of cases. Reasoning does not eliminate contextual sensitivity but reshapes it: social-pressure cues such as user preference and emotional appeal weaken across benchmarks, while few-shot demonstrations strengthen sharply on both triage and BBQ. We recommend direction-flipped influence pairs as a standard complement to context-free moral-bias evaluation, and release the harness and data to make such audits routine.
title Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs
topic Machine Learning
Artificial Intelligence
Computation and Language
Computer Vision and Pattern Recognition
Computers and Society
url https://arxiv.org/abs/2602.22831