Saved in:
Bibliographic Details
Main Authors: Ashcraft, Chace, Karra, Kiran, Carney, Josh, Drenkow, Nathan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2504.08943
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917982887739392
author Ashcraft, Chace
Karra, Kiran
Carney, Josh
Drenkow, Nathan
author_facet Ashcraft, Chace
Karra, Kiran
Carney, Josh
Drenkow, Nathan
contents The Treacherous Turn refers to the scenario where an artificial intelligence (AI) agent subtly, and perhaps covertly, learns to perform a behavior that benefits itself but is deemed undesirable and potentially harmful to a human supervisor. During training, the agent learns to behave as expected by the human supervisor, but when deployed to perform its task, it performs an alternate behavior without the supervisor there to prevent it. Initial experiments applying DRL to an implementation of the A Link to the Past example do not produce the treacherous turn effect naturally, despite various modifications to the environment intended to produce it. However, in this work, we find the treacherous behavior to be reproducible in a DRL agent when using other trojan injection strategies. This approach deviates from the prototypical treacherous turn behavior since the behavior is explicitly trained into the agent, rather than occurring as an emergent consequence of environmental complexity or poor objective specification. Nonetheless, these experiments provide new insights into the challenges of producing agents capable of true treacherous turn behavior.
format Preprint
id arxiv_https___arxiv_org_abs_2504_08943
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Investigating the Treacherous Turn in Deep Reinforcement Learning
Ashcraft, Chace
Karra, Kiran
Carney, Josh
Drenkow, Nathan
Machine Learning
Artificial Intelligence
The Treacherous Turn refers to the scenario where an artificial intelligence (AI) agent subtly, and perhaps covertly, learns to perform a behavior that benefits itself but is deemed undesirable and potentially harmful to a human supervisor. During training, the agent learns to behave as expected by the human supervisor, but when deployed to perform its task, it performs an alternate behavior without the supervisor there to prevent it. Initial experiments applying DRL to an implementation of the A Link to the Past example do not produce the treacherous turn effect naturally, despite various modifications to the environment intended to produce it. However, in this work, we find the treacherous behavior to be reproducible in a DRL agent when using other trojan injection strategies. This approach deviates from the prototypical treacherous turn behavior since the behavior is explicitly trained into the agent, rather than occurring as an emergent consequence of environmental complexity or poor objective specification. Nonetheless, these experiments provide new insights into the challenges of producing agents capable of true treacherous turn behavior.
title Investigating the Treacherous Turn in Deep Reinforcement Learning
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2504.08943