Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xu, Ziwei, Kankanhalli, Mohan
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2410.02451
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912265266003968
author	Xu, Ziwei Kankanhalli, Mohan
author_facet	Xu, Ziwei Kankanhalli, Mohan
contents	Value alignment, which aims to ensure that large language models (LLMs) and other AI agents behave in accordance with human values, is critical for ensuring safety and trustworthiness of these systems. A key component of value alignment is the modeling of human preferences as a representation of human values. In this paper, we investigate the robustness of value alignment by examining the sensitivity of preference models. Specifically, we ask: how do changes in the probabilities of some preferences affect the predictions of these models for other preferences? To answer this question, we theoretically analyze the robustness of widely used preference models by examining their sensitivities to minor changes in preferences they model. Our findings reveal that, in the Bradley-Terry and the Placket-Luce model, the probability of a preference can change significantly as other preferences change, especially when these preferences are dominant (i.e., with probabilities near 0 or 1). We identify specific conditions where this sensitivity becomes significant for these models and discuss the practical implications for the robustness and safety of value alignment in AI systems.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_02451
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Strong Preferences Affect the Robustness of Preference Models and Value Alignment Xu, Ziwei Kankanhalli, Mohan Artificial Intelligence Value alignment, which aims to ensure that large language models (LLMs) and other AI agents behave in accordance with human values, is critical for ensuring safety and trustworthiness of these systems. A key component of value alignment is the modeling of human preferences as a representation of human values. In this paper, we investigate the robustness of value alignment by examining the sensitivity of preference models. Specifically, we ask: how do changes in the probabilities of some preferences affect the predictions of these models for other preferences? To answer this question, we theoretically analyze the robustness of widely used preference models by examining their sensitivities to minor changes in preferences they model. Our findings reveal that, in the Bradley-Terry and the Placket-Luce model, the probability of a preference can change significantly as other preferences change, especially when these preferences are dominant (i.e., with probabilities near 0 or 1). We identify specific conditions where this sensitivity becomes significant for these models and discuss the practical implications for the robustness and safety of value alignment in AI systems.
title	Strong Preferences Affect the Robustness of Preference Models and Value Alignment
topic	Artificial Intelligence
url	https://arxiv.org/abs/2410.02451

Similar Items