Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lin, Yuping, He, Pengfei, Xu, Han, Xing, Yue, Yamada, Makoto, Liu, Hui, Tang, Jiliang
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2406.10794
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913594985152512
author	Lin, Yuping He, Pengfei Xu, Han Xing, Yue Yamada, Makoto Liu, Hui Tang, Jiliang
author_facet	Lin, Yuping He, Pengfei Xu, Han Xing, Yue Yamada, Makoto Liu, Hui Tang, Jiliang
contents	Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents. Although there are diverse jailbreak attack strategies, there is no unified understanding on why some methods succeed and others fail. This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic properties of successful jailbreak attacks. We hypothesize that successful attacks share some similar properties: They are effective in moving the representation of the harmful prompt towards the direction to the harmless prompts. We leverage hidden representations into the objective of existing jailbreak attacks to move the attacks along the acceptance direction, and conduct experiments to validate the above hypothesis using the proposed objective. We hope this study provides new insights into understanding how LLMs understand harmfulness information.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_10794
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis Lin, Yuping He, Pengfei Xu, Han Xing, Yue Yamada, Makoto Liu, Hui Tang, Jiliang Computation and Language Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents. Although there are diverse jailbreak attack strategies, there is no unified understanding on why some methods succeed and others fail. This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic properties of successful jailbreak attacks. We hypothesize that successful attacks share some similar properties: They are effective in moving the representation of the harmful prompt towards the direction to the harmless prompts. We leverage hidden representations into the objective of existing jailbreak attacks to move the attacks along the acceptance direction, and conduct experiments to validate the above hypothesis using the proposed objective. We hope this study provides new insights into understanding how LLMs understand harmfulness information.
title	Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
topic	Computation and Language
url	https://arxiv.org/abs/2406.10794

Similar Items