Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wei, Zhihua, Li, Qiang, Ruan, Jian, Qin, Zhenxin, Wen, Leilei, Liu, Dongrui, Shen, Wen
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.17372
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908896854016000
author	Wei, Zhihua Li, Qiang Ruan, Jian Qin, Zhenxin Wen, Leilei Liu, Dongrui Shen, Wen
author_facet	Wei, Zhihua Li, Qiang Ruan, Jian Qin, Zhenxin Wen, Leilei Liu, Dongrui Shen, Wen
contents	Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_17372
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift Wei, Zhihua Li, Qiang Ruan, Jian Qin, Zhenxin Wen, Leilei Liu, Dongrui Shen, Wen Computer Vision and Pattern Recognition Artificial Intelligence Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.
title	Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2603.17372

Similar Items