Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Geng, Sinong, Pacchiano, Aldo, Kolobov, Andrey, Cheng, Ching-An
Format:	Preprint
Published:	2023
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2306.00321
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917614799814656
author	Geng, Sinong Pacchiano, Aldo Kolobov, Andrey Cheng, Ching-An
author_facet	Geng, Sinong Pacchiano, Aldo Kolobov, Andrey Cheng, Ching-An
contents	We propose Heuristic Blending (HUBL), a simple performance-improving technique for a broad class of offline RL algorithms based on value bootstrapping. HUBL modifies the Bellman operators used in these algorithms, partially replacing the bootstrapped values with heuristic ones that are estimated with Monte-Carlo returns. For trajectories with higher returns, HUBL relies more on the heuristic values and less on bootstrapping; otherwise, it leans more heavily on bootstrapping. HUBL is very easy to combine with many existing offline RL implementations by relabeling the offline datasets with adjusted rewards and discount factors. We derive a theory that explains HUBL's effect on offline RL as reducing offline RL's complexity and thus increasing its finite-sample performance. Furthermore, we empirically demonstrate that HUBL consistently improves the policy quality of four state-of-the-art bootstrapping-based offline RL algorithms (ATAC, CQL, TD3+BC, and IQL), by 9% on average over 27 datasets of the D4RL and Meta-World benchmarks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2306_00321
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	Improving Offline RL by Blending Heuristics Geng, Sinong Pacchiano, Aldo Kolobov, Andrey Cheng, Ching-An Machine Learning We propose Heuristic Blending (HUBL), a simple performance-improving technique for a broad class of offline RL algorithms based on value bootstrapping. HUBL modifies the Bellman operators used in these algorithms, partially replacing the bootstrapped values with heuristic ones that are estimated with Monte-Carlo returns. For trajectories with higher returns, HUBL relies more on the heuristic values and less on bootstrapping; otherwise, it leans more heavily on bootstrapping. HUBL is very easy to combine with many existing offline RL implementations by relabeling the offline datasets with adjusted rewards and discount factors. We derive a theory that explains HUBL's effect on offline RL as reducing offline RL's complexity and thus increasing its finite-sample performance. Furthermore, we empirically demonstrate that HUBL consistently improves the policy quality of four state-of-the-art bootstrapping-based offline RL algorithms (ATAC, CQL, TD3+BC, and IQL), by 9% on average over 27 datasets of the D4RL and Meta-World benchmarks.
title	Improving Offline RL by Blending Heuristics
topic	Machine Learning
url	https://arxiv.org/abs/2306.00321

Similar Items