Saved in:
Bibliographic Details
Main Authors: Ma, Avery, Pan, Yangchen, Farahmand, Amir-massoud
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.01925
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913890677293056
author Ma, Avery
Pan, Yangchen
Farahmand, Amir-massoud
author_facet Ma, Avery
Pan, Yangchen
Farahmand, Amir-massoud
contents Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt's topic. We also introduce ManyHarm, a dataset of harmful question-answer pairs, and demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
format Preprint
id arxiv_https___arxiv_org_abs_2502_01925
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
Ma, Avery
Pan, Yangchen
Farahmand, Amir-massoud
Computation and Language
Cryptography and Security
Machine Learning
Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt's topic. We also introduce ManyHarm, a dataset of harmful question-answer pairs, and demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
title PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
topic Computation and Language
Cryptography and Security
Machine Learning
url https://arxiv.org/abs/2502.01925