Gespeichert in:
| Hauptverfasser: | , , , , , |
|---|---|
| Format: | Preprint |
| Veröffentlicht: |
2026
|
| Schlagworte: | |
| Online-Zugang: | https://arxiv.org/abs/2602.12418 |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| _version_ | 1866915796051034112 |
|---|---|
| author | Assogba, Yannick Cortellazzi, Jacopo Abad, Javier Rodriguez, Pau Suau, Xavier Blaas, Arno |
| author_facet | Assogba, Yannick Cortellazzi, Jacopo Abad, Javier Rodriguez, Pau Suau, Xavier Blaas, Arno |
| contents | Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety-utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_12418 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Sparse Autoencoders are Capable LLM Jailbreak Mitigators Assogba, Yannick Cortellazzi, Jacopo Abad, Javier Rodriguez, Pau Suau, Xavier Blaas, Arno Cryptography and Security Computation and Language Machine Learning Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety-utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training. |
| title | Sparse Autoencoders are Capable LLM Jailbreak Mitigators |
| topic | Cryptography and Security Computation and Language Machine Learning |
| url | https://arxiv.org/abs/2602.12418 |