Salvato in:
Dettagli Bibliografici
Autori principali: Aqrawi, Alan, Abbasi, Arian
Natura: Preprint
Pubblicazione: 2024
Soggetti:
Accesso online:https://arxiv.org/abs/2409.03131
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866916388532125696
author Aqrawi, Alan
Abbasi, Arian
author_facet Aqrawi, Alan
Abbasi, Arian
contents This paper introduces a new method for adversarial attacks on large language models (LLMs) called the Single-Turn Crescendo Attack (STCA). Building on the multi-turn crescendo attack method introduced by Russinovich, Salem, and Eldan (2024), which gradually escalates the context to provoke harmful responses, the STCA achieves similar outcomes in a single interaction. By condensing the escalation into a single, well-crafted prompt, the STCA bypasses typical moderation filters that LLMs use to prevent inappropriate outputs. This technique reveals vulnerabilities in current LLMs and emphasizes the importance of stronger safeguards in responsible AI (RAI). The STCA offers a novel method that has not been previously explored.
format Preprint
id arxiv_https___arxiv_org_abs_2409_03131
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA)
Aqrawi, Alan
Abbasi, Arian
Cryptography and Security
Computation and Language
This paper introduces a new method for adversarial attacks on large language models (LLMs) called the Single-Turn Crescendo Attack (STCA). Building on the multi-turn crescendo attack method introduced by Russinovich, Salem, and Eldan (2024), which gradually escalates the context to provoke harmful responses, the STCA achieves similar outcomes in a single interaction. By condensing the escalation into a single, well-crafted prompt, the STCA bypasses typical moderation filters that LLMs use to prevent inappropriate outputs. This technique reveals vulnerabilities in current LLMs and emphasizes the importance of stronger safeguards in responsible AI (RAI). The STCA offers a novel method that has not been previously explored.
title Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA)
topic Cryptography and Security
Computation and Language
url https://arxiv.org/abs/2409.03131