Saved in:
Bibliographic Details
Main Authors: Lin, Yuping, He, Pengfei, Xu, Han, Xing, Yue, Yamada, Makoto, Liu, Hui, Tang, Jiliang
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.10794
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913594985152512
author Lin, Yuping
He, Pengfei
Xu, Han
Xing, Yue
Yamada, Makoto
Liu, Hui
Tang, Jiliang
author_facet Lin, Yuping
He, Pengfei
Xu, Han
Xing, Yue
Yamada, Makoto
Liu, Hui
Tang, Jiliang
contents Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents. Although there are diverse jailbreak attack strategies, there is no unified understanding on why some methods succeed and others fail. This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic properties of successful jailbreak attacks. We hypothesize that successful attacks share some similar properties: They are effective in moving the representation of the harmful prompt towards the direction to the harmless prompts. We leverage hidden representations into the objective of existing jailbreak attacks to move the attacks along the acceptance direction, and conduct experiments to validate the above hypothesis using the proposed objective. We hope this study provides new insights into understanding how LLMs understand harmfulness information.
format Preprint
id arxiv_https___arxiv_org_abs_2406_10794
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
Lin, Yuping
He, Pengfei
Xu, Han
Xing, Yue
Yamada, Makoto
Liu, Hui
Tang, Jiliang
Computation and Language
Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents. Although there are diverse jailbreak attack strategies, there is no unified understanding on why some methods succeed and others fail. This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic properties of successful jailbreak attacks. We hypothesize that successful attacks share some similar properties: They are effective in moving the representation of the harmful prompt towards the direction to the harmless prompts. We leverage hidden representations into the objective of existing jailbreak attacks to move the attacks along the acceptance direction, and conduct experiments to validate the above hypothesis using the proposed objective. We hope this study provides new insights into understanding how LLMs understand harmfulness information.
title Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
topic Computation and Language
url https://arxiv.org/abs/2406.10794