Saved in:
| Main Authors: | Assogba, Yannick, Cortellazzi, Jacopo, Abad, Javier, Rodriguez, Pau, Suau, Xavier, Blaas, Arno |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.12418 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
by: Muhamed, Aashiq, et al.
Published: (2025)
by: Muhamed, Aashiq, et al.
Published: (2025)
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
by: Zeng, Yifan, et al.
Published: (2024)
by: Zeng, Yifan, et al.
Published: (2024)
Copyright-Protected Language Generation via Adaptive Model Fusion
by: Abad, Javier, et al.
Published: (2024)
by: Abad, Javier, et al.
Published: (2024)
Towards Understanding the Robustness of Sparse Autoencoders
by: Saiyed, Ahson, et al.
Published: (2026)
by: Saiyed, Ahson, et al.
Published: (2026)
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
by: Li, Nathaniel, et al.
Published: (2024)
by: Li, Nathaniel, et al.
Published: (2024)
AdvPrefix: An Objective for Nuanced LLM Jailbreaks
by: Zhu, Sicheng, et al.
Published: (2024)
by: Zhu, Sicheng, et al.
Published: (2024)
Universal Jailbreak Backdoors from Poisoned Human Feedback
by: Rando, Javier, et al.
Published: (2023)
by: Rando, Javier, et al.
Published: (2023)
A StrongREJECT for Empty Jailbreaks
by: Souly, Alexandra, et al.
Published: (2024)
by: Souly, Alexandra, et al.
Published: (2024)
Sockpuppetting: Jailbreaking LLMs by Combining Prefilling with Optimization
by: Dotsinski, Asen, et al.
Published: (2026)
by: Dotsinski, Asen, et al.
Published: (2026)
Testing the Limits of Jailbreaking Defenses with the Purple Problem
by: Kim, Taeyoun, et al.
Published: (2024)
by: Kim, Taeyoun, et al.
Published: (2024)
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
by: Jia, Xiaojun, et al.
Published: (2024)
by: Jia, Xiaojun, et al.
Published: (2024)
VERA: Variational Inference Framework for Jailbreaking Large Language Models
by: Lochab, Anamika, et al.
Published: (2025)
by: Lochab, Anamika, et al.
Published: (2025)
Intriguing Properties of Adversarial ML Attacks in the Problem Space [Extended Version]
by: Cortellazzi, Jacopo, et al.
Published: (2019)
by: Cortellazzi, Jacopo, et al.
Published: (2019)
On the Effectiveness of Adversarial Training on Malware Classifiers
by: Bostani, Hamid, et al.
Published: (2024)
by: Bostani, Hamid, et al.
Published: (2024)
PromptScreen: Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline
by: Rao, Akshaj Prashanth, et al.
Published: (2025)
by: Rao, Akshaj Prashanth, et al.
Published: (2025)
Jailbreaking in the Haystack
by: Shah, Rishi Rajesh, et al.
Published: (2025)
by: Shah, Rishi Rajesh, et al.
Published: (2025)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
by: Li, Qizhang, et al.
Published: (2024)
by: Li, Qizhang, et al.
Published: (2024)
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
by: Hu, Hanjiang, et al.
Published: (2025)
by: Hu, Hanjiang, et al.
Published: (2025)
LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs
by: Li, Ran, et al.
Published: (2025)
by: Li, Ran, et al.
Published: (2025)
STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
by: Li, Jing-Jing, et al.
Published: (2025)
by: Li, Jing-Jing, et al.
Published: (2025)
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
by: Chu, Junjie, et al.
Published: (2024)
by: Chu, Junjie, et al.
Published: (2024)
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
by: Rando, Javier, et al.
Published: (2024)
by: Rando, Javier, et al.
Published: (2024)
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
by: Kim, Heegyu, et al.
Published: (2024)
by: Kim, Heegyu, et al.
Published: (2024)
From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks
by: Zhang, Zhexin, et al.
Published: (2024)
by: Zhang, Zhexin, et al.
Published: (2024)
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
by: Jiang, Yifan, et al.
Published: (2024)
by: Jiang, Yifan, et al.
Published: (2024)
Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models
by: Hu, Kai, et al.
Published: (2025)
by: Hu, Kai, et al.
Published: (2025)
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
by: Fang, Zheng, et al.
Published: (2026)
by: Fang, Zheng, et al.
Published: (2026)
Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses
by: Ahmed, Mohamed, et al.
Published: (2025)
by: Ahmed, Mohamed, et al.
Published: (2025)
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
by: Ma, Avery, et al.
Published: (2025)
by: Ma, Avery, et al.
Published: (2025)
Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities
by: Geng, Jiahui, et al.
Published: (2025)
by: Geng, Jiahui, et al.
Published: (2025)
PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips
by: Coalson, Zachary, et al.
Published: (2024)
by: Coalson, Zachary, et al.
Published: (2024)
CodeCloak: A Method for Evaluating and Mitigating Code Leakage by LLM Code Assistants
by: Noah, Amit Finkman, et al.
Published: (2024)
by: Noah, Amit Finkman, et al.
Published: (2024)
Jailbreaking LLMs via Calibration
by: Lu, Yuxuan, et al.
Published: (2026)
by: Lu, Yuxuan, et al.
Published: (2026)
Jailbreaking with Universal Multi-Prompts
by: Hsu, Yu-Ling, et al.
Published: (2025)
by: Hsu, Yu-Ling, et al.
Published: (2025)
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
by: Sun, Bowen, et al.
Published: (2026)
by: Sun, Bowen, et al.
Published: (2026)
Capability-Based Scaling Trends for LLM-Based Red-Teaming
by: Panfilov, Alexander, et al.
Published: (2025)
by: Panfilov, Alexander, et al.
Published: (2025)
A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy
by: Correia, Pedro H. Barcha, et al.
Published: (2026)
by: Correia, Pedro H. Barcha, et al.
Published: (2026)
TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice
by: Goel, Aman, et al.
Published: (2025)
by: Goel, Aman, et al.
Published: (2025)
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?
by: Chen, Shuo, et al.
Published: (2024)
by: Chen, Shuo, et al.
Published: (2024)
There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective
by: Yilmaz, Edibe, et al.
Published: (2026)
by: Yilmaz, Edibe, et al.
Published: (2026)
Similar Items
-
SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
by: Muhamed, Aashiq, et al.
Published: (2025) -
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
by: Zeng, Yifan, et al.
Published: (2024) -
Copyright-Protected Language Generation via Adaptive Model Fusion
by: Abad, Javier, et al.
Published: (2024) -
Towards Understanding the Robustness of Sparse Autoencoders
by: Saiyed, Ahson, et al.
Published: (2026) -
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
by: Li, Nathaniel, et al.
Published: (2024)