Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yan, Bo, Lin, Weikai, Zhu, Yada, Wang, Song
Format:	Preprint
Published:	2026
Subjects:	Cryptography and Security Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.16824
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911603960578048
author	Yan, Bo Lin, Weikai Zhu, Yada Wang, Song
author_facet	Yan, Bo Lin, Weikai Zhu, Yada Wang, Song
contents	Multi-turn jailbreak attacks progressively erode LLM safety alignment across seemingly innocuous conversation turns, achieving success rates exceeding 90% against state-of-the-art models. Existing alignment-based and guardrail methods suffer from three key limitations: they require costly weight modification, evaluate each turn independently without modeling cumulative safety erosion, and detect attacks only after harmful content has been generated. To address these limitations, we first formulate the proactive early jailbreak detection problem with a new metric, detection lead, that measures how early an attack can be detected before the LLM complies. We then propose SAFEDREAM, a lightweight world-model-based framework that operates as an external module without modifying the LLM's weights. SAFEDREAM introduces three components: (1) a safety state world model that encodes LLM hidden states into a compact safety representation and predicts how it evolves across turns, (2) CUSUM detection that accumulates weak per-turn risk signals into reliable evidence, and (3) contrastive imagination that simultaneously rolls out attack and benign futures in latent space to issue early alarms before jailbreaks occur. On three multi-turn jailbreak benchmarks (XGuard-Train, SafeDialBench, SafeMTData) against 8 baselines, SAFEDREAM achieves the best detection timeliness across all benchmarks (1.06-1.20 turns before compliance) while maintaining competitive false positive rates and outperforming baselines in detection quality.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_16824
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SafeDream: Safety World Model for Proactive Early Jailbreak Detection Yan, Bo Lin, Weikai Zhu, Yada Wang, Song Cryptography and Security Artificial Intelligence Multi-turn jailbreak attacks progressively erode LLM safety alignment across seemingly innocuous conversation turns, achieving success rates exceeding 90% against state-of-the-art models. Existing alignment-based and guardrail methods suffer from three key limitations: they require costly weight modification, evaluate each turn independently without modeling cumulative safety erosion, and detect attacks only after harmful content has been generated. To address these limitations, we first formulate the proactive early jailbreak detection problem with a new metric, detection lead, that measures how early an attack can be detected before the LLM complies. We then propose SAFEDREAM, a lightweight world-model-based framework that operates as an external module without modifying the LLM's weights. SAFEDREAM introduces three components: (1) a safety state world model that encodes LLM hidden states into a compact safety representation and predicts how it evolves across turns, (2) CUSUM detection that accumulates weak per-turn risk signals into reliable evidence, and (3) contrastive imagination that simultaneously rolls out attack and benign futures in latent space to issue early alarms before jailbreaks occur. On three multi-turn jailbreak benchmarks (XGuard-Train, SafeDialBench, SafeMTData) against 8 baselines, SAFEDREAM achieves the best detection timeliness across all benchmarks (1.06-1.20 turns before compliance) while maintaining competitive false positive rates and outperforming baselines in detection quality.
title	SafeDream: Safety World Model for Proactive Early Jailbreak Detection
topic	Cryptography and Security Artificial Intelligence
url	https://arxiv.org/abs/2604.16824

Similar Items