Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shida, Haruhi, Imai, Koo, Kansa, Keigo
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.02652
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917382597902336
author	Shida, Haruhi Imai, Koo Kansa, Keigo
author_facet	Shida, Haruhi Imai, Koo Kansa, Keigo
contents	The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based training does not acquire new capabilities but merely redistributes the utilization probabilities of existing ones. In this study, we propose ``compound jailbreaks'' targeting OpenAI gpt-oss-20b, which exploit the generalization failures of alignment. This approach combines multiple attack techniques -- each individually defended against -- to saturate the instruction hierarchy maintenance process. Our evaluation shows that the attack success rate (ASR) increased from 14.3\% with individual methods to 71.4\% with the combined approach. These results provide empirical evidence for the hypothesis that safety training does not generalize as broadly as model capabilities, highlighting the need for multifaceted safety evaluations using compound attack scenarios.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_02652
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Generalization Limits of Reinforcement Learning Alignment Shida, Haruhi Imai, Koo Kansa, Keigo Machine Learning Artificial Intelligence The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based training does not acquire new capabilities but merely redistributes the utilization probabilities of existing ones. In this study, we propose ``compound jailbreaks'' targeting OpenAI gpt-oss-20b, which exploit the generalization failures of alignment. This approach combines multiple attack techniques -- each individually defended against -- to saturate the instruction hierarchy maintenance process. Our evaluation shows that the attack success rate (ASR) increased from 14.3\% with individual methods to 71.4\% with the combined approach. These results provide empirical evidence for the hypothesis that safety training does not generalize as broadly as model capabilities, highlighting the need for multifaceted safety evaluations using compound attack scenarios.
title	Generalization Limits of Reinforcement Learning Alignment
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2604.02652

Similar Items