Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Singh, Arth
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Artificial Intelligence I.2.6
Online Access:	https://arxiv.org/abs/2604.08557
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914465995292672
author	Singh, Arth
author_facet	Singh, Arth
contents	Safety alignment in diffusion language models (dLLMs) relies on a single load-bearing assumption: that committed tokens are permanent. We show that violating this assumption, by re-masking committed refusal tokens and injecting a short affirmative prefix, achieves 74-82% ASR on HarmBench across all three publicly available safety-tuned dLLMs, rising to 92-98% with a generic 8-token compliance prefix. We call this attack TrajHijack; it is the first trajectory-level attack on dLLMs, requires no gradient computation, and generalizes across SFT and preference-optimized (VRPO) models. Three findings emerge. First, the vulnerability is irreducibly two-component: re-masking alone (4.4%) and prefix alone (5.7%) both fail. Second, gradient optimization via a differentiable Gumbel-softmax chain consistently degrades ASR (41.5% vs. 76.1%), because continuous perturbations push token distributions off-manifold. Third, A2D (the strongest published dLLM defense) is more vulnerable to TrajHijack (89.9%) than the undefended model (76.1%): its silent-refusal training removes the contextual resistance that trajectory-level attacks must overcome, an effect we call the Defense Inversion Effect.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_08557
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models Singh, Arth Computation and Language Artificial Intelligence I.2.6 Safety alignment in diffusion language models (dLLMs) relies on a single load-bearing assumption: that committed tokens are permanent. We show that violating this assumption, by re-masking committed refusal tokens and injecting a short affirmative prefix, achieves 74-82% ASR on HarmBench across all three publicly available safety-tuned dLLMs, rising to 92-98% with a generic 8-token compliance prefix. We call this attack TrajHijack; it is the first trajectory-level attack on dLLMs, requires no gradient computation, and generalizes across SFT and preference-optimized (VRPO) models. Three findings emerge. First, the vulnerability is irreducibly two-component: re-masking alone (4.4%) and prefix alone (5.7%) both fail. Second, gradient optimization via a differentiable Gumbel-softmax chain consistently degrades ASR (41.5% vs. 76.1%), because continuous perturbations push token distributions off-manifold. Third, A2D (the strongest published dLLM defense) is more vulnerable to TrajHijack (89.9%) than the undefended model (76.1%): its silent-refusal training removes the contextual resistance that trajectory-level attacks must overcome, an effect we call the Defense Inversion Effect.
title	Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
topic	Computation and Language Artificial Intelligence I.2.6
url	https://arxiv.org/abs/2604.08557

Similar Items