Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Han, Wang, Gang, Zhang, Huan
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2411.16721
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915269614501888
author	Wang, Han Wang, Gang Zhang, Huan
author_facet	Wang, Han Wang, Gang Zhang, Huan
contents	Vision Language Models (VLMs) can produce unintended and harmful content when exposed to adversarial attacks, particularly because their vision capabilities create new vulnerabilities. Existing defenses, such as input preprocessing, adversarial training, and response evaluation-based methods, are often impractical for real-world deployment due to their high costs. To address this challenge, we propose ASTRA, an efficient and effective defense by adaptively steering models away from adversarial feature directions to resist VLM attacks. Our key procedures involve finding transferable steering vectors representing the direction of harmful response and applying adaptive activation steering to remove these directions at inference time. To create effective steering vectors, we randomly ablate the visual tokens from the adversarial images and identify those most strongly associated with jailbreaks. These tokens are then used to construct steering vectors. During inference, we perform the adaptive steering method that involves the projection between the steering vectors and calibrated activation, resulting in little performance drops on benign inputs while strongly avoiding harmful outputs under adversarial inputs. Extensive experiments across multiple models and baselines demonstrate our state-of-the-art performance and high efficiency in mitigating jailbreak risks. Additionally, ASTRA exhibits good transferability, defending against unseen attacks (i.e., structured-based attack, perturbation-based attack with project gradient descent variants, and text-only attack). Our code is available at \url{https://github.com/ASTRAL-Group/ASTRA}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2411_16721
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks Wang, Han Wang, Gang Zhang, Huan Computer Vision and Pattern Recognition Artificial Intelligence Vision Language Models (VLMs) can produce unintended and harmful content when exposed to adversarial attacks, particularly because their vision capabilities create new vulnerabilities. Existing defenses, such as input preprocessing, adversarial training, and response evaluation-based methods, are often impractical for real-world deployment due to their high costs. To address this challenge, we propose ASTRA, an efficient and effective defense by adaptively steering models away from adversarial feature directions to resist VLM attacks. Our key procedures involve finding transferable steering vectors representing the direction of harmful response and applying adaptive activation steering to remove these directions at inference time. To create effective steering vectors, we randomly ablate the visual tokens from the adversarial images and identify those most strongly associated with jailbreaks. These tokens are then used to construct steering vectors. During inference, we perform the adaptive steering method that involves the projection between the steering vectors and calibrated activation, resulting in little performance drops on benign inputs while strongly avoiding harmful outputs under adversarial inputs. Extensive experiments across multiple models and baselines demonstrate our state-of-the-art performance and high efficiency in mitigating jailbreak risks. Additionally, ASTRA exhibits good transferability, defending against unseen attacks (i.e., structured-based attack, perturbation-based attack with project gradient descent variants, and text-only attack). Our code is available at \url{https://github.com/ASTRAL-Group/ASTRA}.
title	Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2411.16721

Similar Items