Saved in:
Bibliographic Details
Main Authors: Kan, Chun Yan Ryan, Tran, Tommy, Yadav, Vedant, Cai, Ava, Zhu, Kevin, Li, Ruizhe, Chaudhary, Maheep
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.18782
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs.