Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	He, Zeyuan, Chen, Yupeng, Lin, Lang, Wang, Yihan, Chang, Shenxu, Sommerlade, Eric, Torr, Philip, Yu, Junchi, Bibi, Adel, Yu, Jialin
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2602.00388
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917379777232896
author	He, Zeyuan Chen, Yupeng Lin, Lang Wang, Yihan Chang, Shenxu Sommerlade, Eric Torr, Philip Yu, Junchi Bibi, Adel Yu, Jialin
author_facet	He, Zeyuan Chen, Yupeng Lin, Lang Wang, Yihan Chang, Shenxu Sommerlade, Eric Torr, Philip Yu, Junchi Bibi, Adel Yu, Jialin
contents	Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation efficiency. Beyond the utility benefits, we argue that D-LLMs exhibit a previously underexplored safety blessing: their diffusion-style generation confers intrinsic robustness against jailbreak attacks originally designed for AR-LLMs. In this work, we provide an initial analysis of the underlying mechanism, showing that the diffusion trajectory induces a stepwise reduction effect that progressively suppresses unsafe generations. This robustness, however, is not absolute. Following this analysis, we highlight a simple yet effective failure mode, context nesting, in which harmful requests are embedded within structured benign contexts. Empirically, we show that this simple black-box strategy bypasses D-LLMs' safety blessing, achieving state-of-the-art attack success rates across models and benchmarks. Notably, it enables the first successful jailbreak of Gemini Diffusion to our knowledge, exposing a critical vulnerability in proprietary D-LLMs. Together, our results characterize both the origins and the limits of D-LLMs' safety blessing, constituting an early-stage red-teaming of D-LLMs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_00388
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Safer by Diffusion, Broken by Context: Diffusion LLM's Safety Blessing and Its Failure Mode He, Zeyuan Chen, Yupeng Lin, Lang Wang, Yihan Chang, Shenxu Sommerlade, Eric Torr, Philip Yu, Junchi Bibi, Adel Yu, Jialin Machine Learning Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation efficiency. Beyond the utility benefits, we argue that D-LLMs exhibit a previously underexplored safety blessing: their diffusion-style generation confers intrinsic robustness against jailbreak attacks originally designed for AR-LLMs. In this work, we provide an initial analysis of the underlying mechanism, showing that the diffusion trajectory induces a stepwise reduction effect that progressively suppresses unsafe generations. This robustness, however, is not absolute. Following this analysis, we highlight a simple yet effective failure mode, context nesting, in which harmful requests are embedded within structured benign contexts. Empirically, we show that this simple black-box strategy bypasses D-LLMs' safety blessing, achieving state-of-the-art attack success rates across models and benchmarks. Notably, it enables the first successful jailbreak of Gemini Diffusion to our knowledge, exposing a critical vulnerability in proprietary D-LLMs. Together, our results characterize both the origins and the limits of D-LLMs' safety blessing, constituting an early-stage red-teaming of D-LLMs.
title	Safer by Diffusion, Broken by Context: Diffusion LLM's Safety Blessing and Its Failure Mode
topic	Machine Learning
url	https://arxiv.org/abs/2602.00388

Similar Items