I tiakina i:
Ngā taipitopito rārangi puna kōrero
Ngā kaituhi matua: Xhonneux, Sophie, Dobre, David, Tang, Jian, Gidel, Gauthier, Sridhar, Dhanya
Hōputu: Preprint
I whakaputaina: 2024
Ngā marau:
Urunga tuihono:https://arxiv.org/abs/2402.05723
Ngā Tūtohu: Tāpirihia he Tūtohu
Kāore He Tūtohu, Me noho koe te mea tuatahi ki te tūtohu i tēnei pūkete!
_version_ 1866910322938347520
author Xhonneux, Sophie
Dobre, David
Tang, Jian
Gidel, Gauthier
Sridhar, Dhanya
author_facet Xhonneux, Sophie
Dobre, David
Tang, Jian
Gidel, Gauthier
Sridhar, Dhanya
contents Despite significant investment into safety training, large language models (LLMs) deployed in the real world still suffer from numerous vulnerabilities. One perspective on LLM safety training is that it algorithmically forbids the model from answering toxic or harmful queries. To assess the effectiveness of safety training, in this work, we study forbidden tasks, i.e., tasks the model is designed to refuse to answer. Specifically, we investigate whether in-context learning (ICL) can be used to re-learn forbidden tasks despite the explicit fine-tuning of the model to refuse them. We first examine a toy example of refusing sentiment classification to demonstrate the problem. Then, we use ICL on a model fine-tuned to refuse to summarise made-up news articles. Finally, we investigate whether ICL can undo safety training, which could represent a major security risk. For the safety task, we look at Vicuna-7B, Starling-7B, and Llama2-7B. We show that the attack works out-of-the-box on Starling-7B and Vicuna-7B but fails on Llama2-7B. Finally, we propose an ICL attack that uses the chat template tokens like a prompt injection attack to achieve a better attack success rate on Vicuna-7B and Starling-7B. Trigger Warning: the appendix contains LLM-generated text with violence, suicide, and misinformation.
format Preprint
id arxiv_https___arxiv_org_abs_2402_05723
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle In-Context Learning Can Re-learn Forbidden Tasks
Xhonneux, Sophie
Dobre, David
Tang, Jian
Gidel, Gauthier
Sridhar, Dhanya
Machine Learning
Cryptography and Security
Despite significant investment into safety training, large language models (LLMs) deployed in the real world still suffer from numerous vulnerabilities. One perspective on LLM safety training is that it algorithmically forbids the model from answering toxic or harmful queries. To assess the effectiveness of safety training, in this work, we study forbidden tasks, i.e., tasks the model is designed to refuse to answer. Specifically, we investigate whether in-context learning (ICL) can be used to re-learn forbidden tasks despite the explicit fine-tuning of the model to refuse them. We first examine a toy example of refusing sentiment classification to demonstrate the problem. Then, we use ICL on a model fine-tuned to refuse to summarise made-up news articles. Finally, we investigate whether ICL can undo safety training, which could represent a major security risk. For the safety task, we look at Vicuna-7B, Starling-7B, and Llama2-7B. We show that the attack works out-of-the-box on Starling-7B and Vicuna-7B but fails on Llama2-7B. Finally, we propose an ICL attack that uses the chat template tokens like a prompt injection attack to achieve a better attack success rate on Vicuna-7B and Starling-7B. Trigger Warning: the appendix contains LLM-generated text with violence, suicide, and misinformation.
title In-Context Learning Can Re-learn Forbidden Tasks
topic Machine Learning
Cryptography and Security
url https://arxiv.org/abs/2402.05723