Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Assogba, Yannick, Cortellazzi, Jacopo, Abad, Javier, Rodriguez, Pau, Suau, Xavier, Blaas, Arno
Format: Preprint
Veröffentlicht: 2026
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2602.12418
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866915796051034112
author Assogba, Yannick
Cortellazzi, Jacopo
Abad, Javier
Rodriguez, Pau
Suau, Xavier
Blaas, Arno
author_facet Assogba, Yannick
Cortellazzi, Jacopo
Abad, Javier
Rodriguez, Pau
Suau, Xavier
Blaas, Arno
contents Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety-utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training.
format Preprint
id arxiv_https___arxiv_org_abs_2602_12418
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Sparse Autoencoders are Capable LLM Jailbreak Mitigators
Assogba, Yannick
Cortellazzi, Jacopo
Abad, Javier
Rodriguez, Pau
Suau, Xavier
Blaas, Arno
Cryptography and Security
Computation and Language
Machine Learning
Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety-utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training.
title Sparse Autoencoders are Capable LLM Jailbreak Mitigators
topic Cryptography and Security
Computation and Language
Machine Learning
url https://arxiv.org/abs/2602.12418