Saved in:
Bibliographic Details
Main Authors: Flageat, Manon, Lim, Bryan, Cully, Antoine
Format: Preprint
Published: 2023
Subjects:
Online Access:https://arxiv.org/abs/2312.07178
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909078133932032
author Flageat, Manon
Lim, Bryan
Cully, Antoine
author_facet Flageat, Manon
Lim, Bryan
Cully, Antoine
contents Many applications in Reinforcement Learning (RL) usually have noise or stochasticity present in the environment. Beyond their impact on learning, these uncertainties lead the exact same policy to perform differently, i.e. yield different return, from one roll-out to another. Common evaluation procedures in RL summarise the consequent return distributions using solely the expected return, which does not account for the spread of the distribution. Our work defines this spread as the policy reproducibility: the ability of a policy to obtain similar performance when rolled out many times, a crucial property in some real-world applications. We highlight that existing procedures that only use the expected return are limited on two fronts: first an infinite number of return distributions with a wide range of performance-reproducibility trade-offs can have the same expected return, limiting its effectiveness when used for comparing policies; second, the expected return metric does not leave any room for practitioners to choose the best trade-off value for considered applications. In this work, we address these limitations by recommending the use of Lower Confidence Bound, a metric taken from Bayesian optimisation that provides the user with a preference parameter to choose a desired performance-reproducibility trade-off. We also formalise and quantify policy reproducibility, and demonstrate the benefit of our metrics using extensive experiments of popular RL algorithms on common uncertain RL tasks.
format Preprint
id arxiv_https___arxiv_org_abs_2312_07178
institution arXiv
publishDate 2023
record_format arxiv
spellingShingle Beyond Expected Return: Accounting for Policy Reproducibility when Evaluating Reinforcement Learning Algorithms
Flageat, Manon
Lim, Bryan
Cully, Antoine
Machine Learning
Artificial Intelligence
Many applications in Reinforcement Learning (RL) usually have noise or stochasticity present in the environment. Beyond their impact on learning, these uncertainties lead the exact same policy to perform differently, i.e. yield different return, from one roll-out to another. Common evaluation procedures in RL summarise the consequent return distributions using solely the expected return, which does not account for the spread of the distribution. Our work defines this spread as the policy reproducibility: the ability of a policy to obtain similar performance when rolled out many times, a crucial property in some real-world applications. We highlight that existing procedures that only use the expected return are limited on two fronts: first an infinite number of return distributions with a wide range of performance-reproducibility trade-offs can have the same expected return, limiting its effectiveness when used for comparing policies; second, the expected return metric does not leave any room for practitioners to choose the best trade-off value for considered applications. In this work, we address these limitations by recommending the use of Lower Confidence Bound, a metric taken from Bayesian optimisation that provides the user with a preference parameter to choose a desired performance-reproducibility trade-off. We also formalise and quantify policy reproducibility, and demonstrate the benefit of our metrics using extensive experiments of popular RL algorithms on common uncertain RL tasks.
title Beyond Expected Return: Accounting for Policy Reproducibility when Evaluating Reinforcement Learning Algorithms
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2312.07178