Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Meng, Wenjia, Zheng, Qian, Yang, Long, Yin, Yilong, Pan, Gang
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2405.02572
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913341294772224
author	Meng, Wenjia Zheng, Qian Yang, Long Yin, Yilong Pan, Gang
author_facet	Meng, Wenjia Zheng, Qian Yang, Long Yin, Yilong Pan, Gang
contents	Policy-based methods have achieved remarkable success in solving challenging reinforcement learning problems. Among these methods, off-policy policy gradient methods are particularly important due to that they can benefit from off-policy data. However, these methods suffer from the high variance of the off-policy policy gradient (OPPG) estimator, which results in poor sample efficiency during training. In this paper, we propose an off-policy policy gradient method with the optimal action-dependent baseline (Off-OAB) to mitigate this variance issue. Specifically, this baseline maintains the OPPG estimator's unbiasedness while theoretically minimizing its variance. To enhance practical computational efficiency, we design an approximated version of this optimal baseline. Utilizing this approximation, our method (Off-OAB) aims to decrease the OPPG estimator's variance during policy optimization. We evaluate the proposed Off-OAB method on six representative tasks from OpenAI Gym and MuJoCo, where it demonstrably surpasses state-of-the-art methods on the majority of these tasks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2405_02572
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline Meng, Wenjia Zheng, Qian Yang, Long Yin, Yilong Pan, Gang Machine Learning Artificial Intelligence Policy-based methods have achieved remarkable success in solving challenging reinforcement learning problems. Among these methods, off-policy policy gradient methods are particularly important due to that they can benefit from off-policy data. However, these methods suffer from the high variance of the off-policy policy gradient (OPPG) estimator, which results in poor sample efficiency during training. In this paper, we propose an off-policy policy gradient method with the optimal action-dependent baseline (Off-OAB) to mitigate this variance issue. Specifically, this baseline maintains the OPPG estimator's unbiasedness while theoretically minimizing its variance. To enhance practical computational efficiency, we design an approximated version of this optimal baseline. Utilizing this approximation, our method (Off-OAB) aims to decrease the OPPG estimator's variance during policy optimization. We evaluate the proposed Off-OAB method on six representative tasks from OpenAI Gym and MuJoCo, where it demonstrably surpasses state-of-the-art methods on the majority of these tasks.
title	Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2405.02572

Similar Items