Gespeichert in:
| Hauptverfasser: | , , , , , |
|---|---|
| Format: | Preprint |
| Veröffentlicht: |
2025
|
| Schlagworte: | |
| Online-Zugang: | https://arxiv.org/abs/2505.18003 |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| _version_ | 1866916754933940224 |
|---|---|
| author | Clymer, Joshua Weinbaum, Jonah Kirk, Robert Mai, Kimberly Zhang, Selena Davies, Xander |
| author_facet | Clymer, Joshua Weinbaum, Jonah Kirk, Robert Mai, Kimberly Zhang, Selena Davies, Xander |
| contents | Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a "safety case") that misuse safeguards reduce the risk posed by an AI assistant to low levels. We first describe how a hypothetical developer red teams safeguards, estimating the effort required to evade them. Then, the developer plugs this estimate into a quantitative "uplift model" to determine how much barriers introduced by safeguards dissuade misuse (https://www.aimisusemodel.com/). This procedure provides a continuous signal of risk during deployment that helps the developer rapidly respond to emerging threats. Finally, we describe how to tie these components together into a simple safety case. Our work provides one concrete path -- though not the only path -- to rigorously justifying AI misuse risks are low. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2505_18003 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | An Example Safety Case for Safeguards Against Misuse Clymer, Joshua Weinbaum, Jonah Kirk, Robert Mai, Kimberly Zhang, Selena Davies, Xander Machine Learning Artificial Intelligence Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a "safety case") that misuse safeguards reduce the risk posed by an AI assistant to low levels. We first describe how a hypothetical developer red teams safeguards, estimating the effort required to evade them. Then, the developer plugs this estimate into a quantitative "uplift model" to determine how much barriers introduced by safeguards dissuade misuse (https://www.aimisusemodel.com/). This procedure provides a continuous signal of risk during deployment that helps the developer rapidly respond to emerging threats. Finally, we describe how to tie these components together into a simple safety case. Our work provides one concrete path -- though not the only path -- to rigorously justifying AI misuse risks are low. |
| title | An Example Safety Case for Safeguards Against Misuse |
| topic | Machine Learning Artificial Intelligence |
| url | https://arxiv.org/abs/2505.18003 |