Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Huang, Brian R. Y., Li, Maximilian, Tang, Leonard
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2410.01294
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913830595985408
author	Huang, Brian R. Y. Li, Maximilian Tang, Leonard
author_facet	Huang, Brian R. Y. Li, Maximilian Tang, Leonard
contents	Despite extensive safety measures, LLMs are vulnerable to adversarial inputs, or jailbreaks, which can elicit unsafe behaviors. In this work, we introduce bijection learning, a powerful attack algorithm which automatically fuzzes LLMs for safety vulnerabilities using randomly-generated encodings whose complexity can be tightly controlled. We leverage in-context learning to teach models bijective encodings, pass encoded queries to the model to bypass built-in safety mechanisms, and finally decode responses back into English. Our attack is extremely effective on a wide range of frontier language models. Moreover, by controlling complexity parameters such as number of key-value mappings in the encodings, we find a close relationship between the capability level of the attacked LLM and the average complexity of the most effective bijection attacks. Our work highlights that new vulnerabilities in frontier models can emerge with scale: more capable models are more severely jailbroken by bijection attacks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_01294
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Endless Jailbreaks with Bijection Learning Huang, Brian R. Y. Li, Maximilian Tang, Leonard Computation and Language Despite extensive safety measures, LLMs are vulnerable to adversarial inputs, or jailbreaks, which can elicit unsafe behaviors. In this work, we introduce bijection learning, a powerful attack algorithm which automatically fuzzes LLMs for safety vulnerabilities using randomly-generated encodings whose complexity can be tightly controlled. We leverage in-context learning to teach models bijective encodings, pass encoded queries to the model to bypass built-in safety mechanisms, and finally decode responses back into English. Our attack is extremely effective on a wide range of frontier language models. Moreover, by controlling complexity parameters such as number of key-value mappings in the encodings, we find a close relationship between the capability level of the attacked LLM and the average complexity of the most effective bijection attacks. Our work highlights that new vulnerabilities in frontier models can emerge with scale: more capable models are more severely jailbroken by bijection attacks.
title	Endless Jailbreaks with Bijection Learning
topic	Computation and Language
url	https://arxiv.org/abs/2410.01294

Similar Items