Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kawasaki, Amelia, Davis, Andrew, Abbas, Houssam
Format:	Preprint
Published:	2024
Subjects:	Cryptography and Security Machine Learning
Online Access:	https://arxiv.org/abs/2406.03230
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910902169632768
author	Kawasaki, Amelia Davis, Andrew Abbas, Houssam
author_facet	Kawasaki, Amelia Davis, Andrew Abbas, Houssam
contents	The widespread adoption of Large Language Models (LLMs), exemplified by OpenAI's ChatGPT, brings to the forefront the imperative to defend against adversarial threats on these models. These attacks, which manipulate an LLM's output by introducing malicious inputs, undermine the model's integrity and the trust users place in its outputs. In response to this challenge, our paper presents an innovative defensive strategy, given white box access to an LLM, that harnesses residual activation analysis between transformer layers of the LLM. We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification. We curate multiple datasets to demonstrate how this method of classification has high accuracy across multiple types of attack scenarios, including our newly-created attack dataset. Furthermore, we enhance the model's resilience by integrating safety fine-tuning techniques for LLMs in order to measure its effect on our capability to detect attacks. The results underscore the effectiveness of our approach in enhancing the detection and mitigation of adversarial inputs, advancing the security framework within which LLMs operate.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_03230
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Defending Large Language Models Against Attacks With Residual Stream Activation Analysis Kawasaki, Amelia Davis, Andrew Abbas, Houssam Cryptography and Security Machine Learning The widespread adoption of Large Language Models (LLMs), exemplified by OpenAI's ChatGPT, brings to the forefront the imperative to defend against adversarial threats on these models. These attacks, which manipulate an LLM's output by introducing malicious inputs, undermine the model's integrity and the trust users place in its outputs. In response to this challenge, our paper presents an innovative defensive strategy, given white box access to an LLM, that harnesses residual activation analysis between transformer layers of the LLM. We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification. We curate multiple datasets to demonstrate how this method of classification has high accuracy across multiple types of attack scenarios, including our newly-created attack dataset. Furthermore, we enhance the model's resilience by integrating safety fine-tuning techniques for LLMs in order to measure its effect on our capability to detect attacks. The results underscore the effectiveness of our approach in enhancing the detection and mitigation of adversarial inputs, advancing the security framework within which LLMs operate.
title	Defending Large Language Models Against Attacks With Residual Stream Activation Analysis
topic	Cryptography and Security Machine Learning
url	https://arxiv.org/abs/2406.03230

Similar Items