Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ma, Avery, Pan, Yangchen, Farahmand, Amir-massoud
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Cryptography and Security Machine Learning
Online Access:	https://arxiv.org/abs/2502.01925
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913890677293056
author	Ma, Avery Pan, Yangchen Farahmand, Amir-massoud
author_facet	Ma, Avery Pan, Yangchen Farahmand, Amir-massoud
contents	Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt's topic. We also introduce ManyHarm, a dataset of harmful question-answer pairs, and demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_01925
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling Ma, Avery Pan, Yangchen Farahmand, Amir-massoud Computation and Language Cryptography and Security Machine Learning Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt's topic. We also introduce ManyHarm, a dataset of harmful question-answer pairs, and demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
title	PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
topic	Computation and Language Cryptography and Security Machine Learning
url	https://arxiv.org/abs/2502.01925

Similar Items