Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Dung, Leonard, Mai, Florian
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2510.11235
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909841462657024
author	Dung, Leonard Mai, Florian
author_facet	Dung, Leonard Mai, Florian
contents	AI alignment research aims to develop techniques to ensure that AI systems do not cause harm. However, every alignment technique has failure modes, which are conditions in which there is a non-negligible chance that the technique fails to provide safety. As a strategy for risk mitigation, the AI safety community has increasingly adopted a defense-in-depth framework: Conceding that there is no single technique which guarantees safety, defense-in-depth consists in having multiple redundant protections against safety failure, such that safety can be maintained even if some protections fail. However, the success of defense-in-depth depends on how (un)correlated failure modes are across alignment techniques. For example, if all techniques had the exact same failure modes, the defense-in-depth approach would provide no additional protection at all. In this paper, we analyze 7 representative alignment techniques and 7 failure modes to understand the extent to which they overlap. We then discuss our results' implications for understanding the current level of risk and how to prioritize AI alignment research in the future.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_11235
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures? Dung, Leonard Mai, Florian Artificial Intelligence AI alignment research aims to develop techniques to ensure that AI systems do not cause harm. However, every alignment technique has failure modes, which are conditions in which there is a non-negligible chance that the technique fails to provide safety. As a strategy for risk mitigation, the AI safety community has increasingly adopted a defense-in-depth framework: Conceding that there is no single technique which guarantees safety, defense-in-depth consists in having multiple redundant protections against safety failure, such that safety can be maintained even if some protections fail. However, the success of defense-in-depth depends on how (un)correlated failure modes are across alignment techniques. For example, if all techniques had the exact same failure modes, the defense-in-depth approach would provide no additional protection at all. In this paper, we analyze 7 representative alignment techniques and 7 failure modes to understand the extent to which they overlap. We then discuss our results' implications for understanding the current level of risk and how to prioritize AI alignment research in the future.
title	AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures?
topic	Artificial Intelligence
url	https://arxiv.org/abs/2510.11235

Similar Items