Enregistré dans:
Détails bibliographiques
Auteurs principaux: Niess, Georg, Kern, Roman
Format: Preprint
Publié: 2024
Sujets:
Accès en ligne:https://arxiv.org/abs/2405.08400
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866913350336643072
author Niess, Georg
Kern, Roman
author_facet Niess, Georg
Kern, Roman
contents The rapid advancement of large language models (LLMs) has made it increasingly difficult to distinguish between text written by humans and machines. Addressing this, we propose a novel method for generating watermarks that strategically alters token probabilities during generation. Unlike previous works, this method uniquely employs linguistic features such as stylometry. Concretely, we introduce acrostica and sensorimotor norms to LLMs. Further, these features are parameterized by a key, which is updated every sentence. To compute this key, we use semantic zero shot classification, which enhances resilience. In our evaluation, we find that for three or more sentences, our method achieves a false positive and false negative rate of 0.02. For the case of a cyclic translation attack, we observe similar results for seven or more sentences. This research is of particular of interest for proprietary LLMs to facilitate accountability and prevent societal harm.
format Preprint
id arxiv_https___arxiv_org_abs_2405_08400
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Stylometric Watermarks for Large Language Models
Niess, Georg
Kern, Roman
Computation and Language
The rapid advancement of large language models (LLMs) has made it increasingly difficult to distinguish between text written by humans and machines. Addressing this, we propose a novel method for generating watermarks that strategically alters token probabilities during generation. Unlike previous works, this method uniquely employs linguistic features such as stylometry. Concretely, we introduce acrostica and sensorimotor norms to LLMs. Further, these features are parameterized by a key, which is updated every sentence. To compute this key, we use semantic zero shot classification, which enhances resilience. In our evaluation, we find that for three or more sentences, our method achieves a false positive and false negative rate of 0.02. For the case of a cyclic translation attack, we observe similar results for seven or more sentences. This research is of particular of interest for proprietary LLMs to facilitate accountability and prevent societal harm.
title Stylometric Watermarks for Large Language Models
topic Computation and Language
url https://arxiv.org/abs/2405.08400