Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Prakash, Nirmalendu, Jie, Yeo Wei, Abdullah, Amir, Satapathy, Ranjan, Cambria, Erik, Lee, Roy Ka Wei
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2509.09708
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915961578192896
author	Prakash, Nirmalendu Jie, Yeo Wei Abdullah, Amir Satapathy, Ranjan Cambria, Erik Lee, Roy Ka Wei
author_facet	Prakash, Nirmalendu Jie, Yeo Wei Abdullah, Amir Satapathy, Ranjan Cambria, Erik Lee, Roy Ka Wei
contents	Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_09708
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal Prakash, Nirmalendu Jie, Yeo Wei Abdullah, Amir Satapathy, Ranjan Cambria, Erik Lee, Roy Ka Wei Computation and Language Artificial Intelligence Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.
title	Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2509.09708

Similar Items