Saved in:
Bibliographic Details
Main Authors: He, Zeyuan, Chen, Yupeng, Lin, Lang, Wang, Yihan, Chang, Shenxu, Sommerlade, Eric, Torr, Philip, Yu, Junchi, Bibi, Adel, Yu, Jialin
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.00388
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917379777232896
author He, Zeyuan
Chen, Yupeng
Lin, Lang
Wang, Yihan
Chang, Shenxu
Sommerlade, Eric
Torr, Philip
Yu, Junchi
Bibi, Adel
Yu, Jialin
author_facet He, Zeyuan
Chen, Yupeng
Lin, Lang
Wang, Yihan
Chang, Shenxu
Sommerlade, Eric
Torr, Philip
Yu, Junchi
Bibi, Adel
Yu, Jialin
contents Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation efficiency. Beyond the utility benefits, we argue that D-LLMs exhibit a previously underexplored safety blessing: their diffusion-style generation confers intrinsic robustness against jailbreak attacks originally designed for AR-LLMs. In this work, we provide an initial analysis of the underlying mechanism, showing that the diffusion trajectory induces a stepwise reduction effect that progressively suppresses unsafe generations. This robustness, however, is not absolute. Following this analysis, we highlight a simple yet effective failure mode, context nesting, in which harmful requests are embedded within structured benign contexts. Empirically, we show that this simple black-box strategy bypasses D-LLMs' safety blessing, achieving state-of-the-art attack success rates across models and benchmarks. Notably, it enables the first successful jailbreak of Gemini Diffusion to our knowledge, exposing a critical vulnerability in proprietary D-LLMs. Together, our results characterize both the origins and the limits of D-LLMs' safety blessing, constituting an early-stage red-teaming of D-LLMs.
format Preprint
id arxiv_https___arxiv_org_abs_2602_00388
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Safer by Diffusion, Broken by Context: Diffusion LLM's Safety Blessing and Its Failure Mode
He, Zeyuan
Chen, Yupeng
Lin, Lang
Wang, Yihan
Chang, Shenxu
Sommerlade, Eric
Torr, Philip
Yu, Junchi
Bibi, Adel
Yu, Jialin
Machine Learning
Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation efficiency. Beyond the utility benefits, we argue that D-LLMs exhibit a previously underexplored safety blessing: their diffusion-style generation confers intrinsic robustness against jailbreak attacks originally designed for AR-LLMs. In this work, we provide an initial analysis of the underlying mechanism, showing that the diffusion trajectory induces a stepwise reduction effect that progressively suppresses unsafe generations. This robustness, however, is not absolute. Following this analysis, we highlight a simple yet effective failure mode, context nesting, in which harmful requests are embedded within structured benign contexts. Empirically, we show that this simple black-box strategy bypasses D-LLMs' safety blessing, achieving state-of-the-art attack success rates across models and benchmarks. Notably, it enables the first successful jailbreak of Gemini Diffusion to our knowledge, exposing a critical vulnerability in proprietary D-LLMs. Together, our results characterize both the origins and the limits of D-LLMs' safety blessing, constituting an early-stage red-teaming of D-LLMs.
title Safer by Diffusion, Broken by Context: Diffusion LLM's Safety Blessing and Its Failure Mode
topic Machine Learning
url https://arxiv.org/abs/2602.00388