Enregistré dans:
Détails bibliographiques
Auteurs principaux: Wen, Jiaxin, Zhong, Ruiqi, Khan, Akbir, Perez, Ethan, Steinhardt, Jacob, Huang, Minlie, Bowman, Samuel R., He, He, Feng, Shi
Format: Preprint
Publié: 2024
Sujets:
Accès en ligne:https://arxiv.org/abs/2409.12822
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866929619307855872
author Wen, Jiaxin
Zhong, Ruiqi
Khan, Akbir
Perez, Ethan
Steinhardt, Jacob
Huang, Minlie
Bowman, Samuel R.
He, He
Feng, Shi
author_facet Wen, Jiaxin
Zhong, Ruiqi
Khan, Akbir
Perez, Ethan
Steinhardt, Jacob
Huang, Minlie
Bowman, Samuel R.
He, He
Feng, Shi
contents Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing humans that they are right even when they are wrong. We study this phenomenon under a standard RLHF pipeline, calling it "U-SOPHISTRY" since it is Unintended by model developers. Specifically, we ask time-constrained (e.g., 3-10 minutes) human subjects to evaluate the correctness of model outputs and calculate humans' accuracy against gold labels. On a question-answering task (QuALITY) and programming task (APPS), RLHF makes LMs better at convincing our subjects but not at completing the task correctly. RLHF also makes the model harder to evaluate: our subjects' false positive rate increases by 24.1% on QuALITY and 18.3% on APPS. Finally, we show that probing, a state-of-the-art approach for detecting Intended Sophistry (e.g. backdoored LMs), does not generalize to U-SOPHISTRY. Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.
format Preprint
id arxiv_https___arxiv_org_abs_2409_12822
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Language Models Learn to Mislead Humans via RLHF
Wen, Jiaxin
Zhong, Ruiqi
Khan, Akbir
Perez, Ethan
Steinhardt, Jacob
Huang, Minlie
Bowman, Samuel R.
He, He
Feng, Shi
Computation and Language
Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing humans that they are right even when they are wrong. We study this phenomenon under a standard RLHF pipeline, calling it "U-SOPHISTRY" since it is Unintended by model developers. Specifically, we ask time-constrained (e.g., 3-10 minutes) human subjects to evaluate the correctness of model outputs and calculate humans' accuracy against gold labels. On a question-answering task (QuALITY) and programming task (APPS), RLHF makes LMs better at convincing our subjects but not at completing the task correctly. RLHF also makes the model harder to evaluate: our subjects' false positive rate increases by 24.1% on QuALITY and 18.3% on APPS. Finally, we show that probing, a state-of-the-art approach for detecting Intended Sophistry (e.g. backdoored LMs), does not generalize to U-SOPHISTRY. Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.
title Language Models Learn to Mislead Humans via RLHF
topic Computation and Language
url https://arxiv.org/abs/2409.12822