Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.06503 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866912696566284288 |
|---|---|
| author | Wang, Guojian Liu, Jianxiang Li, Xinyuan Wu, Faguo Zhang, Xiao Chen, Tianyuan Chen, Xuyang |
| author_facet | Wang, Guojian Liu, Jianxiang Li, Xinyuan Wu, Faguo Zhang, Xiao Chen, Tianyuan Chen, Xuyang |
| contents | In this paper, we investigate preference-based reinforcement learning (PbRL), which enables reinforcement learning (RL) agents to learn from human feedback. This is particularly valuable when defining a fine-grain reward function is not feasible. However, this approach is inefficient and impractical for promoting deep exploration in hard-exploration tasks with long horizons and sparse rewards. To tackle this issue, we introduce LOPE: \textbf{L}earning \textbf{O}nline with trajectory \textbf{P}reference guidanc\textbf{E}, an end-to-end preference-guided RL framework that enhances exploration efficiency in hard-exploration tasks. Our intuition is that LOPE directly adjusts the focus of online exploration by considering human feedback as guidance, thereby avoiding the need to learn a separate reward model from preferences. Specifically, LOPE includes a two-step sequential policy optimization technique consisting of trust-region-based policy improvement and preference guidance steps. We reformulate preference guidance as a trajectory-wise state marginal matching problem that minimizes the maximum mean discrepancy distance between the preferred trajectories and the learned policy. Furthermore, we provide a theoretical analysis to characterize the performance improvement bound and evaluate the effectiveness of the LOPE. When assessed in various challenging hard-exploration environments, LOPE outperforms several state-of-the-art methods in terms of convergence rate and overall performance.The code used in this study is available at https://github.com/buaawgj/LOPE. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2407_06503 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | Preference-Guided Reinforcement Learning for Efficient Exploration Wang, Guojian Liu, Jianxiang Li, Xinyuan Wu, Faguo Zhang, Xiao Chen, Tianyuan Chen, Xuyang Machine Learning In this paper, we investigate preference-based reinforcement learning (PbRL), which enables reinforcement learning (RL) agents to learn from human feedback. This is particularly valuable when defining a fine-grain reward function is not feasible. However, this approach is inefficient and impractical for promoting deep exploration in hard-exploration tasks with long horizons and sparse rewards. To tackle this issue, we introduce LOPE: \textbf{L}earning \textbf{O}nline with trajectory \textbf{P}reference guidanc\textbf{E}, an end-to-end preference-guided RL framework that enhances exploration efficiency in hard-exploration tasks. Our intuition is that LOPE directly adjusts the focus of online exploration by considering human feedback as guidance, thereby avoiding the need to learn a separate reward model from preferences. Specifically, LOPE includes a two-step sequential policy optimization technique consisting of trust-region-based policy improvement and preference guidance steps. We reformulate preference guidance as a trajectory-wise state marginal matching problem that minimizes the maximum mean discrepancy distance between the preferred trajectories and the learned policy. Furthermore, we provide a theoretical analysis to characterize the performance improvement bound and evaluate the effectiveness of the LOPE. When assessed in various challenging hard-exploration environments, LOPE outperforms several state-of-the-art methods in terms of convergence rate and overall performance.The code used in this study is available at https://github.com/buaawgj/LOPE. |
| title | Preference-Guided Reinforcement Learning for Efficient Exploration |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2407.06503 |