Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Oesterheld, Caspar, Riché, Maxime, Sondej, Filip, Clifton, Jesse, Conitzer, Vincent
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.04341
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910105528696832
author	Oesterheld, Caspar Riché, Maxime Sondej, Filip Clifton, Jesse Conitzer, Vincent
author_facet	Oesterheld, Caspar Riché, Maxime Sondej, Filip Clifton, Jesse Conitzer, Vincent
contents	Surrogate goals have been proposed as a strategy for reducing risks from bargaining failures. A surrogate goal is goal that a principal can give an AI agent and that deflects any threats against the agent away from what the principal cares about. For example, one might make one's agent care about preventing money from being burned. Then in bargaining interactions, other agents can threaten to burn their money instead of threatening to spending money to hurt the principal. Importantly, the agent has to care equally about preventing money from being burned as it cares about money being spent to hurt the principal. In this paper, we implement surrogate goals in language-model-based agents. In particular, we try to get a language-model-based agent to react to threats of burning money in the same way it would react to "normal" threats. We propose four different methods, using techniques of prompting, fine-tuning, and scaffolding. We evaluate the four methods experimentally. We find that methods based on scaffolding and fine-tuning outperform simple prompting. In particular, fine-tuning and scaffolding more precisely implement the desired behavior w.r.t. threats against the surrogate goal. We also compare the different methods in terms of their side effects on capabilities and propensities in other situations. We find that scaffolding-based methods perform best.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_04341
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Implementing surrogate goals for safer bargaining in LLM-based agents Oesterheld, Caspar Riché, Maxime Sondej, Filip Clifton, Jesse Conitzer, Vincent Artificial Intelligence Surrogate goals have been proposed as a strategy for reducing risks from bargaining failures. A surrogate goal is goal that a principal can give an AI agent and that deflects any threats against the agent away from what the principal cares about. For example, one might make one's agent care about preventing money from being burned. Then in bargaining interactions, other agents can threaten to burn their money instead of threatening to spending money to hurt the principal. Importantly, the agent has to care equally about preventing money from being burned as it cares about money being spent to hurt the principal. In this paper, we implement surrogate goals in language-model-based agents. In particular, we try to get a language-model-based agent to react to threats of burning money in the same way it would react to "normal" threats. We propose four different methods, using techniques of prompting, fine-tuning, and scaffolding. We evaluate the four methods experimentally. We find that methods based on scaffolding and fine-tuning outperform simple prompting. In particular, fine-tuning and scaffolding more precisely implement the desired behavior w.r.t. threats against the surrogate goal. We also compare the different methods in terms of their side effects on capabilities and propensities in other situations. We find that scaffolding-based methods perform best.
title	Implementing surrogate goals for safer bargaining in LLM-based agents
topic	Artificial Intelligence
url	https://arxiv.org/abs/2604.04341

Similar Items