Saved in:
Bibliographic Details
Main Authors: Lindström, Adam Dahlgren, Methnani, Leila, Krause, Lea, Ericson, Petter, de Troya, Íñigo Martínez de Rituerto, Mollo, Dimitri Coelho, Dobbe, Roel
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.18346
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910502696779776
author Lindström, Adam Dahlgren
Methnani, Leila
Krause, Lea
Ericson, Petter
de Troya, Íñigo Martínez de Rituerto
Mollo, Dimitri Coelho
Dobbe, Roel
author_facet Lindström, Adam Dahlgren
Methnani, Leila
Krause, Lea
Ericson, Petter
de Troya, Íñigo Martínez de Rituerto
Mollo, Dimitri Coelho
Dobbe, Roel
contents This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback (RLxF) methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLxF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics and contributing to AI safety. We highlight tensions and contradictions inherent in the goals of RLxF. In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLxF, among which the trade-offs between user-friendliness and deception, flexibility and interpretability, and system safety. We conclude by urging researchers and practitioners alike to critically assess the sociotechnical ramifications of RLxF, advocating for a more nuanced and reflective approach to its application in AI development.
format Preprint
id arxiv_https___arxiv_org_abs_2406_18346
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations
Lindström, Adam Dahlgren
Methnani, Leila
Krause, Lea
Ericson, Petter
de Troya, Íñigo Martínez de Rituerto
Mollo, Dimitri Coelho
Dobbe, Roel
Artificial Intelligence
This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback (RLxF) methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLxF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics and contributing to AI safety. We highlight tensions and contradictions inherent in the goals of RLxF. In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLxF, among which the trade-offs between user-friendliness and deception, flexibility and interpretability, and system safety. We conclude by urging researchers and practitioners alike to critically assess the sociotechnical ramifications of RLxF, advocating for a more nuanced and reflective approach to its application in AI development.
title AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations
topic Artificial Intelligence
url https://arxiv.org/abs/2406.18346