Salvato in:
Dettagli Bibliografici
Autori principali: Liu, Pai, Zhao, Lingfeng, Agarwal, Shivangi, Liu, Jinghan, Huang, Audrey, Amortila, Philip, Jiang, Nan
Natura: Preprint
Pubblicazione: 2025
Soggetti:
Accesso online:https://arxiv.org/abs/2502.08021
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866918168436408320
author Liu, Pai
Zhao, Lingfeng
Agarwal, Shivangi
Liu, Jinghan
Huang, Audrey
Amortila, Philip
Jiang, Nan
author_facet Liu, Pai
Zhao, Lingfeng
Agarwal, Shivangi
Liu, Jinghan
Huang, Audrey
Amortila, Philip
Jiang, Nan
contents Holdout validation and hyperparameter tuning from data is a long-standing problem in offline reinforcement learning (RL). A standard framework is to use off-policy evaluation (OPE) methods to evaluate and select the policies, but OPE either incurs exponential variance (e.g., importance sampling) or has hyperparameters on their own (e.g., FQE and model-based). We focus on hyperparameter tuning for OPE itself, which is even more under-investigated. Concretely, we select among candidate value functions ("model-free") or dynamics ("model-based") to best assess the performance of a target policy. Concretely, we select among candidate value functions (``model-free'') or dynamics models (``model-based'') to best assess the performance of a target policy. We develop: (1) new model-free and model-based selectors with theoretical guarantees, and (2) a new experimental protocol for empirically evaluating them. Compared to the model-free protocol in prior works, our new protocol allows for more stable generation and better control of candidate value functions in an optimization-free manner, and evaluation of model-free and model-based methods alike. We exemplify the protocol on Gym-Hopper, and find that our new model-free selector, LSTD-Tournament, demonstrates promising empirical performance.
format Preprint
id arxiv_https___arxiv_org_abs_2502_08021
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol
Liu, Pai
Zhao, Lingfeng
Agarwal, Shivangi
Liu, Jinghan
Huang, Audrey
Amortila, Philip
Jiang, Nan
Machine Learning
Artificial Intelligence
Holdout validation and hyperparameter tuning from data is a long-standing problem in offline reinforcement learning (RL). A standard framework is to use off-policy evaluation (OPE) methods to evaluate and select the policies, but OPE either incurs exponential variance (e.g., importance sampling) or has hyperparameters on their own (e.g., FQE and model-based). We focus on hyperparameter tuning for OPE itself, which is even more under-investigated. Concretely, we select among candidate value functions ("model-free") or dynamics ("model-based") to best assess the performance of a target policy. Concretely, we select among candidate value functions (``model-free'') or dynamics models (``model-based'') to best assess the performance of a target policy. We develop: (1) new model-free and model-based selectors with theoretical guarantees, and (2) a new experimental protocol for empirically evaluating them. Compared to the model-free protocol in prior works, our new protocol allows for more stable generation and better control of candidate value functions in an optimization-free manner, and evaluation of model-free and model-based methods alike. We exemplify the protocol on Gym-Hopper, and find that our new model-free selector, LSTD-Tournament, demonstrates promising empirical performance.
title Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2502.08021