Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Han, Peixuan, Qian, Cheng, Chen, Xiusi, Zhang, Yuji, Ji, Heng, Zhang, Denghui
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2502.01042
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912586727948288
author	Han, Peixuan Qian, Cheng Chen, Xiusi Zhang, Yuji Ji, Heng Zhang, Denghui
author_facet	Han, Peixuan Qian, Cheng Chen, Xiusi Zhang, Yuji Ji, Heng Zhang, Denghui
contents	Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content. Existing safety mechanisms, while improving model safety, often lead to overly cautious behavior and fail to fully leverage LLMs' internal cognitive processes. Inspired by humans' reflective thinking capability, we first show that LLMs can similarly perform internal assessments about safety in their internal states. Building on this insight, we propose SafeSwitch, a dynamic framework that regulates unsafe outputs by utilizing the prober-based internal state monitor that actively detects harmful intentions, and activates a safety head that leads to safer and more conservative responses only when necessary. SafeSwitch reduces harmful outputs by approximately 80% on harmful queries while maintaining strong utility, reaching a Pareto optimal among several methods. Our method is also advantageous over traditional methods in offering more informative, context-aware refusals, and achieves these benefits while only tuning less than 6% of the original parameters. SafeSwitch demonstrates large language models' capacity for self-awareness and reflection regarding safety, offering a promising approach to more nuanced and effective safety controls. Codes for this work are available at https://github.com/Hanpx20/SafeSwitch.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_01042
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals Han, Peixuan Qian, Cheng Chen, Xiusi Zhang, Yuji Ji, Heng Zhang, Denghui Machine Learning Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content. Existing safety mechanisms, while improving model safety, often lead to overly cautious behavior and fail to fully leverage LLMs' internal cognitive processes. Inspired by humans' reflective thinking capability, we first show that LLMs can similarly perform internal assessments about safety in their internal states. Building on this insight, we propose SafeSwitch, a dynamic framework that regulates unsafe outputs by utilizing the prober-based internal state monitor that actively detects harmful intentions, and activates a safety head that leads to safer and more conservative responses only when necessary. SafeSwitch reduces harmful outputs by approximately 80% on harmful queries while maintaining strong utility, reaching a Pareto optimal among several methods. Our method is also advantageous over traditional methods in offering more informative, context-aware refusals, and achieves these benefits while only tuning less than 6% of the original parameters. SafeSwitch demonstrates large language models' capacity for self-awareness and reflection regarding safety, offering a promising approach to more nuanced and effective safety controls. Codes for this work are available at https://github.com/Hanpx20/SafeSwitch.
title	SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals
topic	Machine Learning
url	https://arxiv.org/abs/2502.01042

Similar Items