Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Cao, Wenjun
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.04578
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915276309659648
author	Cao, Wenjun
author_facet	Cao, Wenjun
contents	Reinforcement learning (RL) fine-tuning transforms large language models while creating a vulnerability we experimentally verify: Our experiment shows that malicious RL fine-tuning dismantles safety guardrails with remarkable efficiency, requiring only 50 steps and minimal adversarial prompts, with harmful escalating from 0-2 to 7-9. This attack vector particularly threatens open-source models with parameter-level access. Existing defenses targeting supervised fine-tuning prove ineffective against RL's dynamic feedback mechanisms. We introduce Reward Neutralization, the first defense framework specifically designed against RL fine-tuning attacks, establishing concise rejection patterns that render malicious reward signals ineffective. Our approach trains models to produce minimal-information rejections that attackers cannot exploit, systematically neutralizing attempts to optimize toward harmful outputs. Experiments validate that our approach maintains low harmful scores (no greater than 2) after 200 attack steps, while standard models rapidly deteriorate. This work provides the first constructive proof that robust defense against increasingly accessible RL attacks is achievable, addressing a critical security gap for open-weight models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_04578
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization Cao, Wenjun Machine Learning Artificial Intelligence Reinforcement learning (RL) fine-tuning transforms large language models while creating a vulnerability we experimentally verify: Our experiment shows that malicious RL fine-tuning dismantles safety guardrails with remarkable efficiency, requiring only 50 steps and minimal adversarial prompts, with harmful escalating from 0-2 to 7-9. This attack vector particularly threatens open-source models with parameter-level access. Existing defenses targeting supervised fine-tuning prove ineffective against RL's dynamic feedback mechanisms. We introduce Reward Neutralization, the first defense framework specifically designed against RL fine-tuning attacks, establishing concise rejection patterns that render malicious reward signals ineffective. Our approach trains models to produce minimal-information rejections that attackers cannot exploit, systematically neutralizing attempts to optimize toward harmful outputs. Experiments validate that our approach maintains low harmful scores (no greater than 2) after 200 attack steps, while standard models rapidly deteriorate. This work provides the first constructive proof that robust defense against increasingly accessible RL attacks is achievable, addressing a critical security gap for open-weight models.
title	Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2505.04578

Similar Items