Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ashcraft, Chace, Karra, Kiran, Carney, Josh, Drenkow, Nathan
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2504.08943
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917982887739392
author	Ashcraft, Chace Karra, Kiran Carney, Josh Drenkow, Nathan
author_facet	Ashcraft, Chace Karra, Kiran Carney, Josh Drenkow, Nathan
contents	The Treacherous Turn refers to the scenario where an artificial intelligence (AI) agent subtly, and perhaps covertly, learns to perform a behavior that benefits itself but is deemed undesirable and potentially harmful to a human supervisor. During training, the agent learns to behave as expected by the human supervisor, but when deployed to perform its task, it performs an alternate behavior without the supervisor there to prevent it. Initial experiments applying DRL to an implementation of the A Link to the Past example do not produce the treacherous turn effect naturally, despite various modifications to the environment intended to produce it. However, in this work, we find the treacherous behavior to be reproducible in a DRL agent when using other trojan injection strategies. This approach deviates from the prototypical treacherous turn behavior since the behavior is explicitly trained into the agent, rather than occurring as an emergent consequence of environmental complexity or poor objective specification. Nonetheless, these experiments provide new insights into the challenges of producing agents capable of true treacherous turn behavior.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_08943
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Investigating the Treacherous Turn in Deep Reinforcement Learning Ashcraft, Chace Karra, Kiran Carney, Josh Drenkow, Nathan Machine Learning Artificial Intelligence The Treacherous Turn refers to the scenario where an artificial intelligence (AI) agent subtly, and perhaps covertly, learns to perform a behavior that benefits itself but is deemed undesirable and potentially harmful to a human supervisor. During training, the agent learns to behave as expected by the human supervisor, but when deployed to perform its task, it performs an alternate behavior without the supervisor there to prevent it. Initial experiments applying DRL to an implementation of the A Link to the Past example do not produce the treacherous turn effect naturally, despite various modifications to the environment intended to produce it. However, in this work, we find the treacherous behavior to be reproducible in a DRL agent when using other trojan injection strategies. This approach deviates from the prototypical treacherous turn behavior since the behavior is explicitly trained into the agent, rather than occurring as an emergent consequence of environmental complexity or poor objective specification. Nonetheless, these experiments provide new insights into the challenges of producing agents capable of true treacherous turn behavior.
title	Investigating the Treacherous Turn in Deep Reinforcement Learning
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2504.08943

Similar Items