MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Kan, Chun Yan Ryan, Tran, Tommy, Yadav, Vedant, Cai, Ava, Zhu, Kevin, Li, Ruizhe, Chaudhary, Maheep
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Cryptography and Security Artificial Intelligence Computation and Language Machine Learning
Accesso online:	https://arxiv.org/abs/2602.18782
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866914342109184000
author	Kan, Chun Yan Ryan Tran, Tommy Yadav, Vedant Cai, Ava Zhu, Kevin Li, Ruizhe Chaudhary, Maheep
author_facet	Kan, Chun Yan Ryan Tran, Tommy Yadav, Vedant Cai, Ava Zhu, Kevin Li, Ruizhe Chaudhary, Maheep
contents	Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_18782
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs Kan, Chun Yan Ryan Tran, Tommy Yadav, Vedant Cai, Ava Zhu, Kevin Li, Ruizhe Chaudhary, Maheep Cryptography and Security Artificial Intelligence Computation and Language Machine Learning Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs.
title	MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
topic	Cryptography and Security Artificial Intelligence Computation and Language Machine Learning
url	https://arxiv.org/abs/2602.18782

Documenti analoghi