Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Jiajia, Wen, Xiaoyu, Ma, Zhongtian, Hu, Shuyue, Zhang, Qiaosheng, Wang, Zhen
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.01899
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917456617930752
author	Li, Jiajia Wen, Xiaoyu Ma, Zhongtian Hu, Shuyue Zhang, Qiaosheng Wang, Zhen
author_facet	Li, Jiajia Wen, Xiaoyu Ma, Zhongtian Hu, Shuyue Zhang, Qiaosheng Wang, Zhen
contents	The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona-Invariant Alignment (PIA), an adversarial self-play framework that achieves co-evolution through Persona Lineage Evolution (PLE) on the attack side and Persona-Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona-based jailbreak attacks. Experimental results demonstrate that PLE efficiently explores high-risk persona spaces by leveraging lineage-based credit propagation. Meanwhile, the PICL defense method significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability, thereby validating the superiority and robustness of this alignment paradigm. Codes are available at https://github.com/JiajiaLi-1130/PIA.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_01899
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment Li, Jiajia Wen, Xiaoyu Ma, Zhongtian Hu, Shuyue Zhang, Qiaosheng Wang, Zhen Artificial Intelligence The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona-Invariant Alignment (PIA), an adversarial self-play framework that achieves co-evolution through Persona Lineage Evolution (PLE) on the attack side and Persona-Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona-based jailbreak attacks. Experimental results demonstrate that PLE efficiently explores high-risk persona spaces by leveraging lineage-based credit propagation. Meanwhile, the PICL defense method significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability, thereby validating the superiority and robustness of this alignment paradigm. Codes are available at https://github.com/JiajiaLi-1130/PIA.
title	Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
topic	Artificial Intelligence
url	https://arxiv.org/abs/2605.01899

Similar Items