Saved in:
Bibliographic Details
Main Authors: Berezin, Sergey, Farahbakhsh, Reza, Crespi, Noel
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.18708
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908556058427392
author Berezin, Sergey
Farahbakhsh, Reza
Crespi, Noel
author_facet Berezin, Sergey
Farahbakhsh, Reza
Crespi, Noel
contents We introduce a novel class of adversarial attacks on toxicity detection models that exploit language models' failure to interpret spatially structured text in the form of ASCII art. To evaluate the effectiveness of these attacks, we propose ToxASCII, a benchmark designed to assess the robustness of toxicity detection systems against visually obfuscated inputs. Our attacks achieve a perfect Attack Success Rate (ASR) across a diverse set of state-of-the-art large language models and dedicated moderation tools, revealing a significant vulnerability in current text-only moderation systems.
format Preprint
id arxiv_https___arxiv_org_abs_2409_18708
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems
Berezin, Sergey
Farahbakhsh, Reza
Crespi, Noel
Computation and Language
Artificial Intelligence
Cryptography and Security
We introduce a novel class of adversarial attacks on toxicity detection models that exploit language models' failure to interpret spatially structured text in the form of ASCII art. To evaluate the effectiveness of these attacks, we propose ToxASCII, a benchmark designed to assess the robustness of toxicity detection systems against visually obfuscated inputs. Our attacks achieve a perfect Attack Success Rate (ASR) across a diverse set of state-of-the-art large language models and dedicated moderation tools, revealing a significant vulnerability in current text-only moderation systems.
title Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems
topic Computation and Language
Artificial Intelligence
Cryptography and Security
url https://arxiv.org/abs/2409.18708