Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Clymer, Joshua, Weinbaum, Jonah, Kirk, Robert, Mai, Kimberly, Zhang, Selena, Davies, Xander
Format: Preprint
Veröffentlicht: 2025
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2505.18003
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866916754933940224
author Clymer, Joshua
Weinbaum, Jonah
Kirk, Robert
Mai, Kimberly
Zhang, Selena
Davies, Xander
author_facet Clymer, Joshua
Weinbaum, Jonah
Kirk, Robert
Mai, Kimberly
Zhang, Selena
Davies, Xander
contents Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a "safety case") that misuse safeguards reduce the risk posed by an AI assistant to low levels. We first describe how a hypothetical developer red teams safeguards, estimating the effort required to evade them. Then, the developer plugs this estimate into a quantitative "uplift model" to determine how much barriers introduced by safeguards dissuade misuse (https://www.aimisusemodel.com/). This procedure provides a continuous signal of risk during deployment that helps the developer rapidly respond to emerging threats. Finally, we describe how to tie these components together into a simple safety case. Our work provides one concrete path -- though not the only path -- to rigorously justifying AI misuse risks are low.
format Preprint
id arxiv_https___arxiv_org_abs_2505_18003
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle An Example Safety Case for Safeguards Against Misuse
Clymer, Joshua
Weinbaum, Jonah
Kirk, Robert
Mai, Kimberly
Zhang, Selena
Davies, Xander
Machine Learning
Artificial Intelligence
Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a "safety case") that misuse safeguards reduce the risk posed by an AI assistant to low levels. We first describe how a hypothetical developer red teams safeguards, estimating the effort required to evade them. Then, the developer plugs this estimate into a quantitative "uplift model" to determine how much barriers introduced by safeguards dissuade misuse (https://www.aimisusemodel.com/). This procedure provides a continuous signal of risk during deployment that helps the developer rapidly respond to emerging threats. Finally, we describe how to tie these components together into a simple safety case. Our work provides one concrete path -- though not the only path -- to rigorously justifying AI misuse risks are low.
title An Example Safety Case for Safeguards Against Misuse
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2505.18003