Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Fu, Yonggan, Zhang, Shunyao, Wu, Shang, Wan, Cheng, Lin, Yingyan Celine
Format:	Preprint
Published:	2022
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2203.08392
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915089661034496
author	Fu, Yonggan Zhang, Shunyao Wu, Shang Wan, Cheng Lin, Yingyan Celine
author_facet	Fu, Yonggan Zhang, Shunyao Wu, Shang Wan, Cheng Lin, Yingyan Celine
contents	Vision transformers (ViTs) have recently set off a new wave in neural architecture design thanks to their record-breaking performance in various vision tasks. In parallel, to fulfill the goal of deploying ViTs into real-world vision applications, their robustness against potential malicious attacks has gained increasing attention. In particular, recent works show that ViTs are more robust against adversarial attacks as compared with convolutional neural networks (CNNs), and conjecture that this is because ViTs focus more on capturing global interactions among different input/feature patches, leading to their improved robustness to local perturbations imposed by adversarial attacks. In this work, we ask an intriguing question: "Under what kinds of perturbations do ViTs become more vulnerable learners compared to CNNs?" Driven by this question, we first conduct a comprehensive experiment regarding the robustness of both ViTs and CNNs under various existing adversarial attacks to understand the underlying reason favoring their robustness. Based on the drawn insights, we then propose a dedicated attack framework, dubbed Patch-Fool, that fools the self-attention mechanism by attacking its basic component (i.e., a single patch) with a series of attention-aware optimization techniques. Interestingly, our Patch-Fool framework shows for the first time that ViTs are not necessarily more robust than CNNs against adversarial perturbations. In particular, we find that ViTs are more vulnerable learners compared with CNNs against our Patch-Fool attack which is consistent across extensive experiments, and the observations from Sparse/Mild Patch-Fool, two variants of Patch-Fool, indicate an intriguing insight that the perturbation density and strength on each patch seem to be the key factors that influence the robustness ranking between ViTs and CNNs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2203_08392
institution	arXiv
publishDate	2022
record_format	arxiv
spellingShingle	Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations? Fu, Yonggan Zhang, Shunyao Wu, Shang Wan, Cheng Lin, Yingyan Celine Computer Vision and Pattern Recognition Vision transformers (ViTs) have recently set off a new wave in neural architecture design thanks to their record-breaking performance in various vision tasks. In parallel, to fulfill the goal of deploying ViTs into real-world vision applications, their robustness against potential malicious attacks has gained increasing attention. In particular, recent works show that ViTs are more robust against adversarial attacks as compared with convolutional neural networks (CNNs), and conjecture that this is because ViTs focus more on capturing global interactions among different input/feature patches, leading to their improved robustness to local perturbations imposed by adversarial attacks. In this work, we ask an intriguing question: "Under what kinds of perturbations do ViTs become more vulnerable learners compared to CNNs?" Driven by this question, we first conduct a comprehensive experiment regarding the robustness of both ViTs and CNNs under various existing adversarial attacks to understand the underlying reason favoring their robustness. Based on the drawn insights, we then propose a dedicated attack framework, dubbed Patch-Fool, that fools the self-attention mechanism by attacking its basic component (i.e., a single patch) with a series of attention-aware optimization techniques. Interestingly, our Patch-Fool framework shows for the first time that ViTs are not necessarily more robust than CNNs against adversarial perturbations. In particular, we find that ViTs are more vulnerable learners compared with CNNs against our Patch-Fool attack which is consistent across extensive experiments, and the observations from Sparse/Mild Patch-Fool, two variants of Patch-Fool, indicate an intriguing insight that the perturbation density and strength on each patch seem to be the key factors that influence the robustness ranking between ViTs and CNNs.
title	Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations?
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2203.08392

Similar Items