Saved in:
Bibliographic Details
Main Authors: Iablochnikov, Viacheslav, Rogachev, Alexander
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2412.01725
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916503130996736
author Iablochnikov, Viacheslav
Rogachev, Alexander
author_facet Iablochnikov, Viacheslav
Rogachev, Alexander
contents Today, models capable of working with various modalities simultaneously in a chat format are gaining increasing popularity. Despite this, there is an issue of potential attacks on these models, especially considering that many of them include open-source components. It is important to study whether the vulnerabilities of these components are inherited and how dangerous this can be when using such models in the industry. This work is dedicated to researching various types of attacks on such models and evaluating their generalization capabilities. Modern VLM models (LLaVA, BLIP, etc.) often use pre-trained parts from other models, so the main part of this research focuses on them, specifically on the CLIP architecture and its image encoder (CLIP-ViT) and various patch attack variations for it.
format Preprint
id arxiv_https___arxiv_org_abs_2412_01725
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Attacks on multimodal models
Iablochnikov, Viacheslav
Rogachev, Alexander
Computer Vision and Pattern Recognition
Today, models capable of working with various modalities simultaneously in a chat format are gaining increasing popularity. Despite this, there is an issue of potential attacks on these models, especially considering that many of them include open-source components. It is important to study whether the vulnerabilities of these components are inherited and how dangerous this can be when using such models in the industry. This work is dedicated to researching various types of attacks on such models and evaluating their generalization capabilities. Modern VLM models (LLaVA, BLIP, etc.) often use pre-trained parts from other models, so the main part of this research focuses on them, specifically on the CLIP architecture and its image encoder (CLIP-ViT) and various patch attack variations for it.
title Attacks on multimodal models
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2412.01725