Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wichers, Nevan, Ebtekar, Aram, Azarbal, Ariana, Gillioz, Victor, Ye, Christine, Ryd, Emil, Rathi, Neil, Sleight, Henry, Mallen, Alex, Roger, Fabien, Marks, Samuel
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2510.05024
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917047408001024
author	Wichers, Nevan Ebtekar, Aram Azarbal, Ariana Gillioz, Victor Ye, Christine Ryd, Emil Rathi, Neil Sleight, Henry Mallen, Alex Roger, Fabien Marks, Samuel
author_facet	Wichers, Nevan Ebtekar, Aram Azarbal, Ariana Gillioz, Victor Ye, Christine Ryd, Emil Rathi, Neil Sleight, Henry Mallen, Alex Roger, Fabien Marks, Samuel
contents	Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it. For example, to inoculate against reward hacking, we modify the prompts used in supervised fine-tuning to request code that only works on provided test cases but fails on other inputs. Across four settings we find that IP reduces the learning of undesired behavior without substantially reducing the learning of desired capabilities. We also show that prompts which more strongly elicit the undesired behavior prior to fine-tuning more effectively inoculate against the behavior when used during training; this serves as a heuristic to identify promising inoculation prompts. Overall, IP is a simple yet effective way to control how models generalize from fine-tuning, preventing learning of undesired behaviors without substantially disrupting desired capabilities.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_05024
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment Wichers, Nevan Ebtekar, Aram Azarbal, Ariana Gillioz, Victor Ye, Christine Ryd, Emil Rathi, Neil Sleight, Henry Mallen, Alex Roger, Fabien Marks, Samuel Machine Learning Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it. For example, to inoculate against reward hacking, we modify the prompts used in supervised fine-tuning to request code that only works on provided test cases but fails on other inputs. Across four settings we find that IP reduces the learning of undesired behavior without substantially reducing the learning of desired capabilities. We also show that prompts which more strongly elicit the undesired behavior prior to fine-tuning more effectively inoculate against the behavior when used during training; this serves as a heuristic to identify promising inoculation prompts. Overall, IP is a simple yet effective way to control how models generalize from fine-tuning, preventing learning of undesired behaviors without substantially disrupting desired capabilities.
title	Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment
topic	Machine Learning
url	https://arxiv.org/abs/2510.05024

Similar Items