Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Assogba, Yannick, Cortellazzi, Jacopo, Abad, Javier, Rodriguez, Pau, Suau, Xavier, Blaas, Arno
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Cryptography and Security Computation and Language Machine Learning
Online-Zugang:	https://arxiv.org/abs/2602.12418
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866915796051034112
author	Assogba, Yannick Cortellazzi, Jacopo Abad, Javier Rodriguez, Pau Suau, Xavier Blaas, Arno
author_facet	Assogba, Yannick Cortellazzi, Jacopo Abad, Javier Rodriguez, Pau Suau, Xavier Blaas, Arno
contents	Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety-utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_12418
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Sparse Autoencoders are Capable LLM Jailbreak Mitigators Assogba, Yannick Cortellazzi, Jacopo Abad, Javier Rodriguez, Pau Suau, Xavier Blaas, Arno Cryptography and Security Computation and Language Machine Learning Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety-utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training.
title	Sparse Autoencoders are Capable LLM Jailbreak Mitigators
topic	Cryptography and Security Computation and Language Machine Learning
url	https://arxiv.org/abs/2602.12418

Ähnliche Einträge