Saved in:
Bibliographic Details
Main Authors: Zhang, Guoxi, Bao, Han, Kashima, Hisashi
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2403.10160
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • In preference-based reinforcement learning (PbRL), a reward function is learned from a type of human feedback called preference. To expedite preference collection, recent works have leveraged \emph{offline preferences}, which are preferences collected for some offline data. In this scenario, the learned reward function is fitted on the offline data. If a learning agent exhibits behaviors that do not overlap with the offline data, the learned reward function may encounter generalizability issues. To address this problem, the present study introduces a framework that consolidates offline preferences and \emph{virtual preferences} for PbRL, which are comparisons between the agent's behaviors and the offline data. Critically, the reward function can track the agent's behaviors using the virtual preferences, thereby offering well-aligned guidance to the agent. Through experiments on continuous control tasks, this study demonstrates the effectiveness of incorporating the virtual preferences in PbRL.