Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kan, Chun Yan Ryan, Tran, Tommy, Yadav, Vedant, Cai, Ava, Zhu, Kevin, Li, Ruizhe, Chaudhary, Maheep
Format:	Preprint
Published:	2026
Subjects:	Cryptography and Security Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2602.18782
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs.

Similar Items