Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Peng, Zuquan, He, Yuanyuan, Ni, Jianbing, Niu, Ben
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence I.2.7
Online Access:	https://arxiv.org/abs/2409.03183
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929487344566272
author	Peng, Zuquan He, Yuanyuan Ni, Jianbing Niu, Ben
author_facet	Peng, Zuquan He, Yuanyuan Ni, Jianbing Niu, Ben
contents	Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_03183
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers Peng, Zuquan He, Yuanyuan Ni, Jianbing Niu, Ben Computation and Language Artificial Intelligence I.2.7 Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.
title	Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers
topic	Computation and Language Artificial Intelligence I.2.7
url	https://arxiv.org/abs/2409.03183

Similar Items