Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.22831 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866909024679624704 |
|---|---|
| author | Blandfort, Phil Karayil, Tushar McKenzie, Alex Pawar, Urja Graham, Robert Krasheninnikov, Dmitrii |
| author_facet | Blandfort, Phil Karayil, Tushar McKenzie, Alex Pawar, Urja Graham, Robert Krasheninnikov, Dmitrii |
| contents | Moral benchmarks for LLMs typically score models on context-free prompts, implicitly treating the measured choice rate as stable. We test this assumption with a direction-flipped influence audit: for each scenario, we compare a baseline prompt with matched cues steering toward option A or option B. Across a trolley-problem-style moral triage task, BBQ, and DailyDilemmas, and across five LLM families with and without reasoning, short contextual cues shift per-condition choice rates by 12-18 percentage points on average. These shifts reveal structure that baseline scores miss: roughly 40% of baseline-neutral triage and BBQ conditions exhibit directional asymmetry under influence, and a meaningful share of significant effects backfire, moving opposite the cue's intended direction. In follow-up probes, models often recognize the cue while denying that it affected their choice. Among significant backfire trials, this stated-vs.-revealed inconsistency appears in 78% of cases. Reasoning does not eliminate contextual sensitivity but reshapes it: social-pressure cues such as user preference and emotional appeal weaken across benchmarks, while few-shot demonstrations strengthen sharply on both triage and BBQ. We recommend direction-flipped influence pairs as a standard complement to context-free moral-bias evaluation, and release the harness and data to make such audits routine. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_22831 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs Blandfort, Phil Karayil, Tushar McKenzie, Alex Pawar, Urja Graham, Robert Krasheninnikov, Dmitrii Machine Learning Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition Computers and Society Moral benchmarks for LLMs typically score models on context-free prompts, implicitly treating the measured choice rate as stable. We test this assumption with a direction-flipped influence audit: for each scenario, we compare a baseline prompt with matched cues steering toward option A or option B. Across a trolley-problem-style moral triage task, BBQ, and DailyDilemmas, and across five LLM families with and without reasoning, short contextual cues shift per-condition choice rates by 12-18 percentage points on average. These shifts reveal structure that baseline scores miss: roughly 40% of baseline-neutral triage and BBQ conditions exhibit directional asymmetry under influence, and a meaningful share of significant effects backfire, moving opposite the cue's intended direction. In follow-up probes, models often recognize the cue while denying that it affected their choice. Among significant backfire trials, this stated-vs.-revealed inconsistency appears in 78% of cases. Reasoning does not eliminate contextual sensitivity but reshapes it: social-pressure cues such as user preference and emotional appeal weaken across benchmarks, while few-shot demonstrations strengthen sharply on both triage and BBQ. We recommend direction-flipped influence pairs as a standard complement to context-free moral-bias evaluation, and release the harness and data to make such audits routine. |
| title | Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs |
| topic | Machine Learning Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition Computers and Society |
| url | https://arxiv.org/abs/2602.22831 |