Saved in:
Bibliographic Details
Main Authors: Peng, Zuquan, He, Yuanyuan, Ni, Jianbing, Niu, Ben
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.03183
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929487344566272
author Peng, Zuquan
He, Yuanyuan
Ni, Jianbing
Niu, Ben
author_facet Peng, Zuquan
He, Yuanyuan
Ni, Jianbing
Niu, Ben
contents Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.
format Preprint
id arxiv_https___arxiv_org_abs_2409_03183
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers
Peng, Zuquan
He, Yuanyuan
Ni, Jianbing
Niu, Ben
Computation and Language
Artificial Intelligence
I.2.7
Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.
title Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers
topic Computation and Language
Artificial Intelligence
I.2.7
url https://arxiv.org/abs/2409.03183