Saved in:
Bibliographic Details
Main Authors: Zhang, Mengdi, Goh, Kai Kiat, Zhang, Peixin, Sun, Jun, Xin, Rose Lin, Zhang, Hongyu
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2410.16638
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918032662593536
author Zhang, Mengdi
Goh, Kai Kiat
Zhang, Peixin
Sun, Jun
Xin, Rose Lin
Zhang, Hongyu
author_facet Zhang, Mengdi
Goh, Kai Kiat
Zhang, Peixin
Sun, Jun
Xin, Rose Lin
Zhang, Hongyu
contents Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when misbehaving. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.
format Preprint
id arxiv_https___arxiv_org_abs_2410_16638
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle LLMScan: Causal Scan for LLM Misbehavior Detection
Zhang, Mengdi
Goh, Kai Kiat
Zhang, Peixin
Sun, Jun
Xin, Rose Lin
Zhang, Hongyu
Artificial Intelligence
Computation and Language
Machine Learning
Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when misbehaving. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.
title LLMScan: Causal Scan for LLM Misbehavior Detection
topic Artificial Intelligence
Computation and Language
Machine Learning
url https://arxiv.org/abs/2410.16638