Saved in:
Bibliographic Details
Main Authors: Barnhart, Logan, Bafghi, Reza Akbarian, Becker, Stephen, Raissi, Maziar
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.09025
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910871618322432
author Barnhart, Logan
Bafghi, Reza Akbarian
Becker, Stephen
Raissi, Maziar
author_facet Barnhart, Logan
Bafghi, Reza Akbarian
Becker, Stephen
Raissi, Maziar
contents Reinforcement Learning from Human Feedback (RLHF) is increasingly used to align large language models (LLMs) with human preferences. However, the effectiveness of RLHF in addressing underlying biases remains unclear. This study investigates the relationship between RLHF and both covert and overt biases in LLMs, particularly focusing on biases against African Americans. We applied various RLHF techniques (DPO, ORPO, and RLOO) to Llama 3 8B and evaluated the covert and overt biases of the resulting models using matched-guise probing and explicit bias testing. We performed additional tests with DPO on different base models and datasets; among several implications, we found that SFT before RLHF calcifies model biases. Additionally, we extend the tools for measuring biases to multi-modal models. Through our experiments we collect evidence that indicates that current alignment techniques are inadequate for nebulous tasks such as mitigating covert biases, highlighting the need for capable datasets, data curating techniques, or alignment tools.
format Preprint
id arxiv_https___arxiv_org_abs_2503_09025
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Aligning to What? Limits to RLHF Based Alignment
Barnhart, Logan
Bafghi, Reza Akbarian
Becker, Stephen
Raissi, Maziar
Computation and Language
Reinforcement Learning from Human Feedback (RLHF) is increasingly used to align large language models (LLMs) with human preferences. However, the effectiveness of RLHF in addressing underlying biases remains unclear. This study investigates the relationship between RLHF and both covert and overt biases in LLMs, particularly focusing on biases against African Americans. We applied various RLHF techniques (DPO, ORPO, and RLOO) to Llama 3 8B and evaluated the covert and overt biases of the resulting models using matched-guise probing and explicit bias testing. We performed additional tests with DPO on different base models and datasets; among several implications, we found that SFT before RLHF calcifies model biases. Additionally, we extend the tools for measuring biases to multi-modal models. Through our experiments we collect evidence that indicates that current alignment techniques are inadequate for nebulous tasks such as mitigating covert biases, highlighting the need for capable datasets, data curating techniques, or alignment tools.
title Aligning to What? Limits to RLHF Based Alignment
topic Computation and Language
url https://arxiv.org/abs/2503.09025