Saved in:
Bibliographic Details
Main Authors: Papadatos, Henry, Freedman, Rachel
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2412.00967
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915042578923520
author Papadatos, Henry
Freedman, Rachel
author_facet Papadatos, Henry
Freedman, Rachel
contents Large language models (LLMs) are often sycophantic, prioritizing agreement with their users over accurate or objective statements. This problematic behavior becomes more pronounced during reinforcement learning from human feedback (RLHF), an LLM fine-tuning stage intended to align model outputs with human values. Instead of increasing accuracy and reliability, the reward model learned from RLHF often rewards sycophancy. We develop a linear probing method to identify and penalize markers of sycophancy within the reward model, producing rewards that discourage sycophantic behavior. Our experiments show that constructing and optimizing against this surrogate reward function reduces sycophantic behavior in multiple open-source LLMs. Our results suggest a generalizable methodology for reducing unwanted LLM behaviors that are not sufficiently disincentivized by RLHF fine-tuning.
format Preprint
id arxiv_https___arxiv_org_abs_2412_00967
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Linear Probe Penalties Reduce LLM Sycophancy
Papadatos, Henry
Freedman, Rachel
Artificial Intelligence
Large language models (LLMs) are often sycophantic, prioritizing agreement with their users over accurate or objective statements. This problematic behavior becomes more pronounced during reinforcement learning from human feedback (RLHF), an LLM fine-tuning stage intended to align model outputs with human values. Instead of increasing accuracy and reliability, the reward model learned from RLHF often rewards sycophancy. We develop a linear probing method to identify and penalize markers of sycophancy within the reward model, producing rewards that discourage sycophantic behavior. Our experiments show that constructing and optimizing against this surrogate reward function reduces sycophantic behavior in multiple open-source LLMs. Our results suggest a generalizable methodology for reducing unwanted LLM behaviors that are not sufficiently disincentivized by RLHF fine-tuning.
title Linear Probe Penalties Reduce LLM Sycophancy
topic Artificial Intelligence
url https://arxiv.org/abs/2412.00967