Saved in:
Bibliographic Details
Main Authors: Javed, Rafiya, Parent, Cassandra, Kay, Jackie, Yanni, David, Zaini, Abdullah, Sheikh, Anushe, Rauh, Maribeth, Gerych, Walter, Comanescu, Ramona, Gabriel, Iason, Ghassemi, Marzyeh, Weidinger, Laura
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.19463
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918430239621120
author Javed, Rafiya
Parent, Cassandra
Kay, Jackie
Yanni, David
Zaini, Abdullah
Sheikh, Anushe
Rauh, Maribeth
Gerych, Walter
Comanescu, Ramona
Gabriel, Iason
Ghassemi, Marzyeh
Weidinger, Laura
author_facet Javed, Rafiya
Parent, Cassandra
Kay, Jackie
Yanni, David
Zaini, Abdullah
Sheikh, Anushe
Rauh, Maribeth
Gerych, Walter
Comanescu, Ramona
Gabriel, Iason
Ghassemi, Marzyeh
Weidinger, Laura
contents Hedging and non-affirmation are behaviors exhibited by large language models (LLMs) that limit the clear endorsement of specific statements. While these behaviors are desirable in subjective contexts, they are undesirable in the context of human rights - which apply unambiguously to all groups. We present a systematic framework to measure these behaviors in unconstrained LLM responses regarding various identity groups. We evaluate six large proprietary models as well as one open-weight LLM on 4738 prompts across 205 national and stateless ethnic identities and find that 4 out of 7 display hedging and non-affirmation that is significantly dependent on the identity of the group. While factors like conflict signals, sovereignty (whether identity is stateless), or economic indicators (GDP) also influence model behavior, their effect sizes are consistently weaker than the impact of identity itself. The systematic disparity is robust to methods of rephrasing the prompts. Since group identity is the strongest predictor of these behaviors, we use open-weight models to explore whether applying steering and orthogonalization techniques to these group identities can mitigate the rates of hedging and non-affirmation behaviors. We find that group steering is the most effective debiasing approach across query types and is robust to downstream forgetting.
format Preprint
id arxiv_https___arxiv_org_abs_2502_19463
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Hedging and Non-Affirmation: Quantifying LLM Alignment on Questions of Human Rights
Javed, Rafiya
Parent, Cassandra
Kay, Jackie
Yanni, David
Zaini, Abdullah
Sheikh, Anushe
Rauh, Maribeth
Gerych, Walter
Comanescu, Ramona
Gabriel, Iason
Ghassemi, Marzyeh
Weidinger, Laura
Computers and Society
Artificial Intelligence
Social and Information Networks
Hedging and non-affirmation are behaviors exhibited by large language models (LLMs) that limit the clear endorsement of specific statements. While these behaviors are desirable in subjective contexts, they are undesirable in the context of human rights - which apply unambiguously to all groups. We present a systematic framework to measure these behaviors in unconstrained LLM responses regarding various identity groups. We evaluate six large proprietary models as well as one open-weight LLM on 4738 prompts across 205 national and stateless ethnic identities and find that 4 out of 7 display hedging and non-affirmation that is significantly dependent on the identity of the group. While factors like conflict signals, sovereignty (whether identity is stateless), or economic indicators (GDP) also influence model behavior, their effect sizes are consistently weaker than the impact of identity itself. The systematic disparity is robust to methods of rephrasing the prompts. Since group identity is the strongest predictor of these behaviors, we use open-weight models to explore whether applying steering and orthogonalization techniques to these group identities can mitigate the rates of hedging and non-affirmation behaviors. We find that group steering is the most effective debiasing approach across query types and is robust to downstream forgetting.
title Hedging and Non-Affirmation: Quantifying LLM Alignment on Questions of Human Rights
topic Computers and Society
Artificial Intelligence
Social and Information Networks
url https://arxiv.org/abs/2502.19463