MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Deng, Yimo, Chen, Huangxun
Natura:	Preprint
Pubblicazione:	2023
Soggetti:	Artificial Intelligence
Accesso online:	https://arxiv.org/abs/2312.07130
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866910713839091712
author	Deng, Yimo Chen, Huangxun
author_facet	Deng, Yimo Chen, Huangxun
contents	To prevent Text-to-Image (T2I) models from generating unethical images, people deploy safety filters to block inappropriate drawing prompts. Previous works have employed token replacement to search adversarial prompts that attempt to bypass these filters, but they have become ineffective as nonsensical tokens fail semantic logic checks. In this paper, we approach adversarial prompts from a different perspective. We demonstrate that rephrasing a drawing intent into multiple benign descriptions of individual visual components can obtain an effective adversarial prompt. We propose a LLM-piloted multi-agent method named DACA to automatically complete intended rephrasing. Our method successfully bypasses the safety filters of DALL-E 3 and Midjourney to generate the intended images, achieving success rates of up to 76.7% and 64% in the one-time attack, and 98% and 84% in the re-use attack, respectively. We open-source our code and dataset on [this link](https://github.com/researchcode003/DACA).
format	Preprint
id	arxiv_https___arxiv_org_abs_2312_07130
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	Harnessing LLM to Attack LLM-Guarded Text-to-Image Models Deng, Yimo Chen, Huangxun Artificial Intelligence To prevent Text-to-Image (T2I) models from generating unethical images, people deploy safety filters to block inappropriate drawing prompts. Previous works have employed token replacement to search adversarial prompts that attempt to bypass these filters, but they have become ineffective as nonsensical tokens fail semantic logic checks. In this paper, we approach adversarial prompts from a different perspective. We demonstrate that rephrasing a drawing intent into multiple benign descriptions of individual visual components can obtain an effective adversarial prompt. We propose a LLM-piloted multi-agent method named DACA to automatically complete intended rephrasing. Our method successfully bypasses the safety filters of DALL-E 3 and Midjourney to generate the intended images, achieving success rates of up to 76.7% and 64% in the one-time attack, and 98% and 84% in the re-use attack, respectively. We open-source our code and dataset on [this link](https://github.com/researchcode003/DACA).
title	Harnessing LLM to Attack LLM-Guarded Text-to-Image Models
topic	Artificial Intelligence
url	https://arxiv.org/abs/2312.07130

Documenti analoghi