:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Assogba, Yannick, Cortellazzi, Jacopo, Abad, Javier, Rodriguez, Pau, Suau, Xavier, Blaas, Arno
Format:	Preprint
Published:	2026
Subjects:	Cryptography and Security Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2602.12418
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
by: Muhamed, Aashiq, et al.
Published: (2025)

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
by: Zeng, Yifan, et al.
Published: (2024)

Copyright-Protected Language Generation via Adaptive Model Fusion
by: Abad, Javier, et al.
Published: (2024)

Towards Understanding the Robustness of Sparse Autoencoders
by: Saiyed, Ahson, et al.
Published: (2026)

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
by: Li, Nathaniel, et al.
Published: (2024)

AdvPrefix: An Objective for Nuanced LLM Jailbreaks
by: Zhu, Sicheng, et al.
Published: (2024)

Universal Jailbreak Backdoors from Poisoned Human Feedback
by: Rando, Javier, et al.
Published: (2023)

A StrongREJECT for Empty Jailbreaks
by: Souly, Alexandra, et al.
Published: (2024)

Sockpuppetting: Jailbreaking LLMs by Combining Prefilling with Optimization
by: Dotsinski, Asen, et al.
Published: (2026)

Testing the Limits of Jailbreaking Defenses with the Purple Problem
by: Kim, Taeyoun, et al.
Published: (2024)

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
by: Jia, Xiaojun, et al.
Published: (2024)

VERA: Variational Inference Framework for Jailbreaking Large Language Models
by: Lochab, Anamika, et al.
Published: (2025)

Intriguing Properties of Adversarial ML Attacks in the Problem Space [Extended Version]
by: Cortellazzi, Jacopo, et al.
Published: (2019)

On the Effectiveness of Adversarial Training on Malware Classifiers
by: Bostani, Hamid, et al.
Published: (2024)

PromptScreen: Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline
by: Rao, Akshaj Prashanth, et al.
Published: (2025)

Jailbreaking in the Haystack
by: Shah, Rishi Rajesh, et al.
Published: (2025)

Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
by: Li, Qizhang, et al.
Published: (2024)

Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
by: Hu, Hanjiang, et al.
Published: (2025)

LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs
by: Li, Ran, et al.
Published: (2025)

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
by: Li, Jing-Jing, et al.
Published: (2025)

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
by: Chu, Junjie, et al.
Published: (2024)

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
by: Rando, Javier, et al.
Published: (2024)

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
by: Kim, Heegyu, et al.
Published: (2024)

From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks
by: Zhang, Zhexin, et al.
Published: (2024)

RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
by: Jiang, Yifan, et al.
Published: (2024)

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models
by: Hu, Kai, et al.
Published: (2025)

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
by: Fang, Zheng, et al.
Published: (2026)

Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses
by: Ahmed, Mohamed, et al.
Published: (2025)

PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
by: Ma, Avery, et al.
Published: (2025)

Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities
by: Geng, Jiahui, et al.
Published: (2025)

PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips
by: Coalson, Zachary, et al.
Published: (2024)

CodeCloak: A Method for Evaluating and Mitigating Code Leakage by LLM Code Assistants
by: Noah, Amit Finkman, et al.
Published: (2024)

Jailbreaking LLMs via Calibration
by: Lu, Yuxuan, et al.
Published: (2026)

Jailbreaking with Universal Multi-Prompts
by: Hsu, Yu-Ling, et al.
Published: (2025)

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
by: Sun, Bowen, et al.
Published: (2026)

Capability-Based Scaling Trends for LLM-Based Red-Teaming
by: Panfilov, Alexander, et al.
Published: (2025)

A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy
by: Correia, Pedro H. Barcha, et al.
Published: (2026)

TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice
by: Goel, Aman, et al.
Published: (2025)

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?
by: Chen, Shuo, et al.
Published: (2024)

There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective
by: Yilmaz, Edibe, et al.
Published: (2026)