Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	de Melo, Gabriel Adriano, Maximo, Marcos Ricardo Omena De Albuquerque, Soma, Nei Yoshihiro, de Castro, Paulo Andre Lima
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2408.08995
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929462993485824
author	de Melo, Gabriel Adriano Maximo, Marcos Ricardo Omena De Albuquerque Soma, Nei Yoshihiro de Castro, Paulo Andre Lima
author_facet	de Melo, Gabriel Adriano Maximo, Marcos Ricardo Omena De Albuquerque Soma, Nei Yoshihiro de Castro, Paulo Andre Lima
contents	The inner alignment problem, which asserts whether an arbitrary artificial intelligence (AI) model satisfices a non-trivial alignment function of its outputs given its inputs, is undecidable. This is rigorously proved by Rice's theorem, which is also equivalent to a reduction to Turing's Halting Problem, whose proof sketch is presented in this work. Nevertheless, there is an enumerable set of provenly aligned AIs that are constructed from a finite set of provenly aligned operations. Therefore, we argue that the alignment should be a guaranteed property from the AI architecture rather than a characteristic imposed post-hoc on an arbitrary AI model. Furthermore, while the outer alignment problem is the definition of a judge function that captures human values and preferences, we propose that such a function must also impose a halting constraint that guarantees that the AI model always reaches a terminal state in finite execution steps. Our work presents examples and models that illustrate this constraint and the intricate challenges involved, advancing a compelling case for adopting an intrinsically hard-aligned approach to AI systems architectures that ensures halting.
format	Preprint
id	arxiv_https___arxiv_org_abs_2408_08995
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	On the Undecidability of Artificial Intelligence Alignment: Machines that Halt de Melo, Gabriel Adriano Maximo, Marcos Ricardo Omena De Albuquerque Soma, Nei Yoshihiro de Castro, Paulo Andre Lima Artificial Intelligence The inner alignment problem, which asserts whether an arbitrary artificial intelligence (AI) model satisfices a non-trivial alignment function of its outputs given its inputs, is undecidable. This is rigorously proved by Rice's theorem, which is also equivalent to a reduction to Turing's Halting Problem, whose proof sketch is presented in this work. Nevertheless, there is an enumerable set of provenly aligned AIs that are constructed from a finite set of provenly aligned operations. Therefore, we argue that the alignment should be a guaranteed property from the AI architecture rather than a characteristic imposed post-hoc on an arbitrary AI model. Furthermore, while the outer alignment problem is the definition of a judge function that captures human values and preferences, we propose that such a function must also impose a halting constraint that guarantees that the AI model always reaches a terminal state in finite execution steps. Our work presents examples and models that illustrate this constraint and the intricate challenges involved, advancing a compelling case for adopting an intrinsically hard-aligned approach to AI systems architectures that ensures halting.
title	On the Undecidability of Artificial Intelligence Alignment: Machines that Halt
topic	Artificial Intelligence
url	https://arxiv.org/abs/2408.08995

Similar Items