Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Guojian, Wu, Faguo, Zhang, Xiao, Chen, Tianyuan
Format:	Preprint
Published:	2023
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2401.00162
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914987486740480
author	Wang, Guojian Wu, Faguo Zhang, Xiao Chen, Tianyuan
author_facet	Wang, Guojian Wu, Faguo Zhang, Xiao Chen, Tianyuan
contents	The sparsity of reward feedback remains a challenging problem in online deep reinforcement learning (DRL). Previous approaches have utilized offline demonstrations to achieve impressive results in multiple hard tasks. However, these approaches place high demands on demonstration quality, and obtaining expert-like actions is often costly and unrealistic. To tackle these problems, we propose a simple and efficient algorithm called Policy Optimization with Smooth Guidance (POSG), which leverages a small set of state-only demonstrations (where expert action information is not included in demonstrations) to indirectly make approximate and feasible long-term credit assignments and facilitate exploration. Specifically, we first design a trajectory-importance evaluation mechanism to determine the quality of the current trajectory against demonstrations. Then, we introduce a guidance reward computation technology based on trajectory importance to measure the impact of each state-action pair, fusing the demonstrator's state distribution with reward information into the guidance reward. We theoretically analyze the performance improvement caused by smooth guidance rewards and derive a new worst-case lower bound on the performance improvement. Extensive results demonstrate POSG's significant advantages in control performance and convergence speed in four sparse-reward environments, including the grid-world maze, Hopper-v4, HalfCheetah-v4, and Ant maze. Notably, the specific metrics and quantifiable results are investigated to demonstrate the superiority of POSG.
format	Preprint
id	arxiv_https___arxiv_org_abs_2401_00162
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations Wang, Guojian Wu, Faguo Zhang, Xiao Chen, Tianyuan Machine Learning The sparsity of reward feedback remains a challenging problem in online deep reinforcement learning (DRL). Previous approaches have utilized offline demonstrations to achieve impressive results in multiple hard tasks. However, these approaches place high demands on demonstration quality, and obtaining expert-like actions is often costly and unrealistic. To tackle these problems, we propose a simple and efficient algorithm called Policy Optimization with Smooth Guidance (POSG), which leverages a small set of state-only demonstrations (where expert action information is not included in demonstrations) to indirectly make approximate and feasible long-term credit assignments and facilitate exploration. Specifically, we first design a trajectory-importance evaluation mechanism to determine the quality of the current trajectory against demonstrations. Then, we introduce a guidance reward computation technology based on trajectory importance to measure the impact of each state-action pair, fusing the demonstrator's state distribution with reward information into the guidance reward. We theoretically analyze the performance improvement caused by smooth guidance rewards and derive a new worst-case lower bound on the performance improvement. Extensive results demonstrate POSG's significant advantages in control performance and convergence speed in four sparse-reward environments, including the grid-world maze, Hopper-v4, HalfCheetah-v4, and Ant maze. Notably, the specific metrics and quantifiable results are investigated to demonstrate the superiority of POSG.
title	Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations
topic	Machine Learning
url	https://arxiv.org/abs/2401.00162

Similar Items