MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Liu, Pai, Zhao, Lingfeng, Agarwal, Shivangi, Liu, Jinghan, Huang, Audrey, Amortila, Philip, Jiang, Nan
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Machine Learning Artificial Intelligence
Accesso online:	https://arxiv.org/abs/2502.08021
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866918168436408320
author	Liu, Pai Zhao, Lingfeng Agarwal, Shivangi Liu, Jinghan Huang, Audrey Amortila, Philip Jiang, Nan
author_facet	Liu, Pai Zhao, Lingfeng Agarwal, Shivangi Liu, Jinghan Huang, Audrey Amortila, Philip Jiang, Nan
contents	Holdout validation and hyperparameter tuning from data is a long-standing problem in offline reinforcement learning (RL). A standard framework is to use off-policy evaluation (OPE) methods to evaluate and select the policies, but OPE either incurs exponential variance (e.g., importance sampling) or has hyperparameters on their own (e.g., FQE and model-based). We focus on hyperparameter tuning for OPE itself, which is even more under-investigated. Concretely, we select among candidate value functions ("model-free") or dynamics ("model-based") to best assess the performance of a target policy. Concretely, we select among candidate value functions (``model-free'') or dynamics models (``model-based'') to best assess the performance of a target policy. We develop: (1) new model-free and model-based selectors with theoretical guarantees, and (2) a new experimental protocol for empirically evaluating them. Compared to the model-free protocol in prior works, our new protocol allows for more stable generation and better control of candidate value functions in an optimization-free manner, and evaluation of model-free and model-based methods alike. We exemplify the protocol on Gym-Hopper, and find that our new model-free selector, LSTD-Tournament, demonstrates promising empirical performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_08021
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol Liu, Pai Zhao, Lingfeng Agarwal, Shivangi Liu, Jinghan Huang, Audrey Amortila, Philip Jiang, Nan Machine Learning Artificial Intelligence Holdout validation and hyperparameter tuning from data is a long-standing problem in offline reinforcement learning (RL). A standard framework is to use off-policy evaluation (OPE) methods to evaluate and select the policies, but OPE either incurs exponential variance (e.g., importance sampling) or has hyperparameters on their own (e.g., FQE and model-based). We focus on hyperparameter tuning for OPE itself, which is even more under-investigated. Concretely, we select among candidate value functions ("model-free") or dynamics ("model-based") to best assess the performance of a target policy. Concretely, we select among candidate value functions (``model-free'') or dynamics models (``model-based'') to best assess the performance of a target policy. We develop: (1) new model-free and model-based selectors with theoretical guarantees, and (2) a new experimental protocol for empirically evaluating them. Compared to the model-free protocol in prior works, our new protocol allows for more stable generation and better control of candidate value functions in an optimization-free manner, and evaluation of model-free and model-based methods alike. We exemplify the protocol on Gym-Hopper, and find that our new model-free selector, LSTD-Tournament, demonstrates promising empirical performance.
title	Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2502.08021

Documenti analoghi