Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Berezin, Sergey, Farahbakhsh, Reza, Crespi, Noel
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence Cryptography and Security
Online Access:	https://arxiv.org/abs/2409.18708
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908556058427392
author	Berezin, Sergey Farahbakhsh, Reza Crespi, Noel
author_facet	Berezin, Sergey Farahbakhsh, Reza Crespi, Noel
contents	We introduce a novel class of adversarial attacks on toxicity detection models that exploit language models' failure to interpret spatially structured text in the form of ASCII art. To evaluate the effectiveness of these attacks, we propose ToxASCII, a benchmark designed to assess the robustness of toxicity detection systems against visually obfuscated inputs. Our attacks achieve a perfect Attack Success Rate (ASR) across a diverse set of state-of-the-art large language models and dedicated moderation tools, revealing a significant vulnerability in current text-only moderation systems.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_18708
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems Berezin, Sergey Farahbakhsh, Reza Crespi, Noel Computation and Language Artificial Intelligence Cryptography and Security We introduce a novel class of adversarial attacks on toxicity detection models that exploit language models' failure to interpret spatially structured text in the form of ASCII art. To evaluate the effectiveness of these attacks, we propose ToxASCII, a benchmark designed to assess the robustness of toxicity detection systems against visually obfuscated inputs. Our attacks achieve a perfect Attack Success Rate (ASR) across a diverse set of state-of-the-art large language models and dedicated moderation tools, revealing a significant vulnerability in current text-only moderation systems.
title	Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems
topic	Computation and Language Artificial Intelligence Cryptography and Security
url	https://arxiv.org/abs/2409.18708

Similar Items