Salvato in:
Dettagli Bibliografici
Autori principali: Kan, Chun Yan Ryan, Tran, Tommy, Yadav, Vedant, Cai, Ava, Zhu, Kevin, Li, Ruizhe, Chaudhary, Maheep
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2602.18782
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866914342109184000
author Kan, Chun Yan Ryan
Tran, Tommy
Yadav, Vedant
Cai, Ava
Zhu, Kevin
Li, Ruizhe
Chaudhary, Maheep
author_facet Kan, Chun Yan Ryan
Tran, Tommy
Yadav, Vedant
Cai, Ava
Zhu, Kevin
Li, Ruizhe
Chaudhary, Maheep
contents Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs.
format Preprint
id arxiv_https___arxiv_org_abs_2602_18782
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Kan, Chun Yan Ryan
Tran, Tommy
Yadav, Vedant
Cai, Ava
Zhu, Kevin
Li, Ruizhe
Chaudhary, Maheep
Cryptography and Security
Artificial Intelligence
Computation and Language
Machine Learning
Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs.
title MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
topic Cryptography and Security
Artificial Intelligence
Computation and Language
Machine Learning
url https://arxiv.org/abs/2602.18782