Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.00388 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866917379777232896 |
|---|---|
| author | He, Zeyuan Chen, Yupeng Lin, Lang Wang, Yihan Chang, Shenxu Sommerlade, Eric Torr, Philip Yu, Junchi Bibi, Adel Yu, Jialin |
| author_facet | He, Zeyuan Chen, Yupeng Lin, Lang Wang, Yihan Chang, Shenxu Sommerlade, Eric Torr, Philip Yu, Junchi Bibi, Adel Yu, Jialin |
| contents | Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation efficiency. Beyond the utility benefits, we argue that D-LLMs exhibit a previously underexplored safety blessing: their diffusion-style generation confers intrinsic robustness against jailbreak attacks originally designed for AR-LLMs. In this work, we provide an initial analysis of the underlying mechanism, showing that the diffusion trajectory induces a stepwise reduction effect that progressively suppresses unsafe generations. This robustness, however, is not absolute. Following this analysis, we highlight a simple yet effective failure mode, context nesting, in which harmful requests are embedded within structured benign contexts. Empirically, we show that this simple black-box strategy bypasses D-LLMs' safety blessing, achieving state-of-the-art attack success rates across models and benchmarks. Notably, it enables the first successful jailbreak of Gemini Diffusion to our knowledge, exposing a critical vulnerability in proprietary D-LLMs. Together, our results characterize both the origins and the limits of D-LLMs' safety blessing, constituting an early-stage red-teaming of D-LLMs. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_00388 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Safer by Diffusion, Broken by Context: Diffusion LLM's Safety Blessing and Its Failure Mode He, Zeyuan Chen, Yupeng Lin, Lang Wang, Yihan Chang, Shenxu Sommerlade, Eric Torr, Philip Yu, Junchi Bibi, Adel Yu, Jialin Machine Learning Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation efficiency. Beyond the utility benefits, we argue that D-LLMs exhibit a previously underexplored safety blessing: their diffusion-style generation confers intrinsic robustness against jailbreak attacks originally designed for AR-LLMs. In this work, we provide an initial analysis of the underlying mechanism, showing that the diffusion trajectory induces a stepwise reduction effect that progressively suppresses unsafe generations. This robustness, however, is not absolute. Following this analysis, we highlight a simple yet effective failure mode, context nesting, in which harmful requests are embedded within structured benign contexts. Empirically, we show that this simple black-box strategy bypasses D-LLMs' safety blessing, achieving state-of-the-art attack success rates across models and benchmarks. Notably, it enables the first successful jailbreak of Gemini Diffusion to our knowledge, exposing a critical vulnerability in proprietary D-LLMs. Together, our results characterize both the origins and the limits of D-LLMs' safety blessing, constituting an early-stage red-teaming of D-LLMs. |
| title | Safer by Diffusion, Broken by Context: Diffusion LLM's Safety Blessing and Its Failure Mode |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2602.00388 |