Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Barnhart, Logan, Bafghi, Reza Akbarian, Becker, Stephen, Raissi, Maziar
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2503.09025
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910871618322432
author	Barnhart, Logan Bafghi, Reza Akbarian Becker, Stephen Raissi, Maziar
author_facet	Barnhart, Logan Bafghi, Reza Akbarian Becker, Stephen Raissi, Maziar
contents	Reinforcement Learning from Human Feedback (RLHF) is increasingly used to align large language models (LLMs) with human preferences. However, the effectiveness of RLHF in addressing underlying biases remains unclear. This study investigates the relationship between RLHF and both covert and overt biases in LLMs, particularly focusing on biases against African Americans. We applied various RLHF techniques (DPO, ORPO, and RLOO) to Llama 3 8B and evaluated the covert and overt biases of the resulting models using matched-guise probing and explicit bias testing. We performed additional tests with DPO on different base models and datasets; among several implications, we found that SFT before RLHF calcifies model biases. Additionally, we extend the tools for measuring biases to multi-modal models. Through our experiments we collect evidence that indicates that current alignment techniques are inadequate for nebulous tasks such as mitigating covert biases, highlighting the need for capable datasets, data curating techniques, or alignment tools.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_09025
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Aligning to What? Limits to RLHF Based Alignment Barnhart, Logan Bafghi, Reza Akbarian Becker, Stephen Raissi, Maziar Computation and Language Reinforcement Learning from Human Feedback (RLHF) is increasingly used to align large language models (LLMs) with human preferences. However, the effectiveness of RLHF in addressing underlying biases remains unclear. This study investigates the relationship between RLHF and both covert and overt biases in LLMs, particularly focusing on biases against African Americans. We applied various RLHF techniques (DPO, ORPO, and RLOO) to Llama 3 8B and evaluated the covert and overt biases of the resulting models using matched-guise probing and explicit bias testing. We performed additional tests with DPO on different base models and datasets; among several implications, we found that SFT before RLHF calcifies model biases. Additionally, we extend the tools for measuring biases to multi-modal models. Through our experiments we collect evidence that indicates that current alignment techniques are inadequate for nebulous tasks such as mitigating covert biases, highlighting the need for capable datasets, data curating techniques, or alignment tools.
title	Aligning to What? Limits to RLHF Based Alignment
topic	Computation and Language
url	https://arxiv.org/abs/2503.09025

Similar Items