Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Wen, Jiaxin, Zhong, Ruiqi, Khan, Akbir, Perez, Ethan, Steinhardt, Jacob, Huang, Minlie, Bowman, Samuel R., He, He, Feng, Shi
Format:	Preprint
Publié:	2024
Sujets:	Computation and Language
Accès en ligne:	https://arxiv.org/abs/2409.12822
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866929619307855872
author	Wen, Jiaxin Zhong, Ruiqi Khan, Akbir Perez, Ethan Steinhardt, Jacob Huang, Minlie Bowman, Samuel R. He, He Feng, Shi
author_facet	Wen, Jiaxin Zhong, Ruiqi Khan, Akbir Perez, Ethan Steinhardt, Jacob Huang, Minlie Bowman, Samuel R. He, He Feng, Shi
contents	Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing humans that they are right even when they are wrong. We study this phenomenon under a standard RLHF pipeline, calling it "U-SOPHISTRY" since it is Unintended by model developers. Specifically, we ask time-constrained (e.g., 3-10 minutes) human subjects to evaluate the correctness of model outputs and calculate humans' accuracy against gold labels. On a question-answering task (QuALITY) and programming task (APPS), RLHF makes LMs better at convincing our subjects but not at completing the task correctly. RLHF also makes the model harder to evaluate: our subjects' false positive rate increases by 24.1% on QuALITY and 18.3% on APPS. Finally, we show that probing, a state-of-the-art approach for detecting Intended Sophistry (e.g. backdoored LMs), does not generalize to U-SOPHISTRY. Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_12822
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Language Models Learn to Mislead Humans via RLHF Wen, Jiaxin Zhong, Ruiqi Khan, Akbir Perez, Ethan Steinhardt, Jacob Huang, Minlie Bowman, Samuel R. He, He Feng, Shi Computation and Language Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing humans that they are right even when they are wrong. We study this phenomenon under a standard RLHF pipeline, calling it "U-SOPHISTRY" since it is Unintended by model developers. Specifically, we ask time-constrained (e.g., 3-10 minutes) human subjects to evaluate the correctness of model outputs and calculate humans' accuracy against gold labels. On a question-answering task (QuALITY) and programming task (APPS), RLHF makes LMs better at convincing our subjects but not at completing the task correctly. RLHF also makes the model harder to evaluate: our subjects' false positive rate increases by 24.1% on QuALITY and 18.3% on APPS. Finally, we show that probing, a state-of-the-art approach for detecting Intended Sophistry (e.g. backdoored LMs), does not generalize to U-SOPHISTRY. Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.
title	Language Models Learn to Mislead Humans via RLHF
topic	Computation and Language
url	https://arxiv.org/abs/2409.12822

Documents similaires