Saved in:
Bibliographic Details
Main Authors: Kawasaki, Amelia, Davis, Andrew, Abbas, Houssam
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.03230
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910902169632768
author Kawasaki, Amelia
Davis, Andrew
Abbas, Houssam
author_facet Kawasaki, Amelia
Davis, Andrew
Abbas, Houssam
contents The widespread adoption of Large Language Models (LLMs), exemplified by OpenAI's ChatGPT, brings to the forefront the imperative to defend against adversarial threats on these models. These attacks, which manipulate an LLM's output by introducing malicious inputs, undermine the model's integrity and the trust users place in its outputs. In response to this challenge, our paper presents an innovative defensive strategy, given white box access to an LLM, that harnesses residual activation analysis between transformer layers of the LLM. We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification. We curate multiple datasets to demonstrate how this method of classification has high accuracy across multiple types of attack scenarios, including our newly-created attack dataset. Furthermore, we enhance the model's resilience by integrating safety fine-tuning techniques for LLMs in order to measure its effect on our capability to detect attacks. The results underscore the effectiveness of our approach in enhancing the detection and mitigation of adversarial inputs, advancing the security framework within which LLMs operate.
format Preprint
id arxiv_https___arxiv_org_abs_2406_03230
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Defending Large Language Models Against Attacks With Residual Stream Activation Analysis
Kawasaki, Amelia
Davis, Andrew
Abbas, Houssam
Cryptography and Security
Machine Learning
The widespread adoption of Large Language Models (LLMs), exemplified by OpenAI's ChatGPT, brings to the forefront the imperative to defend against adversarial threats on these models. These attacks, which manipulate an LLM's output by introducing malicious inputs, undermine the model's integrity and the trust users place in its outputs. In response to this challenge, our paper presents an innovative defensive strategy, given white box access to an LLM, that harnesses residual activation analysis between transformer layers of the LLM. We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification. We curate multiple datasets to demonstrate how this method of classification has high accuracy across multiple types of attack scenarios, including our newly-created attack dataset. Furthermore, we enhance the model's resilience by integrating safety fine-tuning techniques for LLMs in order to measure its effect on our capability to detect attacks. The results underscore the effectiveness of our approach in enhancing the detection and mitigation of adversarial inputs, advancing the security framework within which LLMs operate.
title Defending Large Language Models Against Attacks With Residual Stream Activation Analysis
topic Cryptography and Security
Machine Learning
url https://arxiv.org/abs/2406.03230