Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Zhou, Shuyi, Song, Zeen, Qiang, Wenwen, Sun, Jiyan, Zhou, Yao, Liu, Yinlong, Ma, Wei
Format:	Preprint
Publié:	2026
Sujets:	Machine Learning
Accès en ligne:	https://arxiv.org/abs/2603.02675
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866910039126573056
author	Zhou, Shuyi Song, Zeen Qiang, Wenwen Sun, Jiyan Zhou, Yao Liu, Yinlong Ma, Wei
author_facet	Zhou, Shuyi Song, Zeen Qiang, Wenwen Sun, Jiyan Zhou, Yao Liu, Yinlong Ma, Wei
contents	Large Language Models remain vulnerable to adversarial prefix attacks (e.g., ``Sure, here is'') despite robust standard safety. We diagnose this vulnerability as Shallow Safety Alignment, stemming from a pathology we term semantic representation decay: as the model generates compliant prefixes, its internal malicious intent signal fades. To address this, we propose Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning. First, grounded in causal identifiability theory, we train a causal intent probe to disentangle invariant intent from stylistic perturbations. Second, we internalize this causal awareness into the policy via Group Relative Policy Optimization. By employing a cumulative causal penalty within ``fork-in-the-road'' training scenarios, we force the model to learn that accumulating harmful tokens monotonically decreases reward, enabling robust late-stage refusals. Experiments show that TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_02675
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	From Shallow to Deep: Pinning Semantic Intent via Causal GRPO Zhou, Shuyi Song, Zeen Qiang, Wenwen Sun, Jiyan Zhou, Yao Liu, Yinlong Ma, Wei Machine Learning Large Language Models remain vulnerable to adversarial prefix attacks (e.g., ``Sure, here is'') despite robust standard safety. We diagnose this vulnerability as Shallow Safety Alignment, stemming from a pathology we term semantic representation decay: as the model generates compliant prefixes, its internal malicious intent signal fades. To address this, we propose Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning. First, grounded in causal identifiability theory, we train a causal intent probe to disentangle invariant intent from stylistic perturbations. Second, we internalize this causal awareness into the policy via Group Relative Policy Optimization. By employing a cumulative causal penalty within ``fork-in-the-road'' training scenarios, we force the model to learn that accumulating harmful tokens monotonically decreases reward, enabling robust late-stage refusals. Experiments show that TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility.
title	From Shallow to Deep: Pinning Semantic Intent via Causal GRPO
topic	Machine Learning
url	https://arxiv.org/abs/2603.02675

Documents similaires