Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mechergui, Malek, Sreedharan, Sarath
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2404.08791
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912096652886016
author	Mechergui, Malek Sreedharan, Sarath
author_facet	Mechergui, Malek Sreedharan, Sarath
contents	Detecting and handling misspecified objectives, such as reward functions, has been widely recognized as one of the central challenges within the domain of Artificial Intelligence (AI) safety research. However, even with the recognition of the importance of this problem, we are unaware of any works that attempt to provide a clear definition for what constitutes (a) misspecified objectives and (b) successfully resolving such misspecifications. In this work, we use the theory of mind, i.e., the human user's beliefs about the AI agent, as a basis to develop a formal explanatory framework called Expectation Alignment (EAL) to understand the objective misspecification and its causes. Our EAL framework not only acts as an explanatory framework for existing works but also provides us with concrete insights into the limitations of existing methods to handle reward misspecification and novel solution strategies. We use these insights to propose a new interactive algorithm that uses the specified reward to infer potential user expectations about the system behavior. We show how one can efficiently implement this algorithm by mapping the inference problem into linear programs. We evaluate our method on a set of standard Markov Decision Process (MDP) benchmarks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2404_08791
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch Mechergui, Malek Sreedharan, Sarath Artificial Intelligence Machine Learning Detecting and handling misspecified objectives, such as reward functions, has been widely recognized as one of the central challenges within the domain of Artificial Intelligence (AI) safety research. However, even with the recognition of the importance of this problem, we are unaware of any works that attempt to provide a clear definition for what constitutes (a) misspecified objectives and (b) successfully resolving such misspecifications. In this work, we use the theory of mind, i.e., the human user's beliefs about the AI agent, as a basis to develop a formal explanatory framework called Expectation Alignment (EAL) to understand the objective misspecification and its causes. Our EAL framework not only acts as an explanatory framework for existing works but also provides us with concrete insights into the limitations of existing methods to handle reward misspecification and novel solution strategies. We use these insights to propose a new interactive algorithm that uses the specified reward to infer potential user expectations about the system behavior. We show how one can efficiently implement this algorithm by mapping the inference problem into linear programs. We evaluate our method on a set of standard Markov Decision Process (MDP) benchmarks.
title	Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch
topic	Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2404.08791

Similar Items