Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Mengdi, Goh, Kai Kiat, Zhang, Peixin, Sun, Jun, Xin, Rose Lin, Zhang, Hongyu
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2410.16638
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918032662593536
author	Zhang, Mengdi Goh, Kai Kiat Zhang, Peixin Sun, Jun Xin, Rose Lin Zhang, Hongyu
author_facet	Zhang, Mengdi Goh, Kai Kiat Zhang, Peixin Sun, Jun Xin, Rose Lin Zhang, Hongyu
contents	Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when misbehaving. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_16638
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	LLMScan: Causal Scan for LLM Misbehavior Detection Zhang, Mengdi Goh, Kai Kiat Zhang, Peixin Sun, Jun Xin, Rose Lin Zhang, Hongyu Artificial Intelligence Computation and Language Machine Learning Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when misbehaving. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.
title	LLMScan: Causal Scan for LLM Misbehavior Detection
topic	Artificial Intelligence Computation and Language Machine Learning
url	https://arxiv.org/abs/2410.16638

Similar Items