Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kiyohara, Haruka, Nomura, Masahiro, Saito, Yuta
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2402.02171
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909110590504960
author	Kiyohara, Haruka Nomura, Masahiro Saito, Yuta
author_facet	Kiyohara, Haruka Nomura, Masahiro Saito, Yuta
contents	We study off-policy evaluation (OPE) in the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates. This problem is widespread in recommender systems, search engines, marketing, to medical applications, however, the typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces, making effective OPE a significant challenge. The PseudoInverse (PI) estimator has been introduced to mitigate the variance issue by assuming linearity in the reward function, but this can result in significant bias as this assumption is hard-to-verify from observed data and is often substantially violated. To address the limitations of previous estimators, we develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space where we optimize slate abstractions to minimize the bias and variance of LIPS in a data-driven way. By doing so, LIPS can substantially reduce the variance of IPS without imposing restrictive assumptions on the reward function structure like linearity. Through empirical evaluation, we demonstrate that LIPS substantially outperforms existing estimators, particularly in scenarios with non-linear rewards and large slate spaces.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_02171
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction Kiyohara, Haruka Nomura, Masahiro Saito, Yuta Machine Learning We study off-policy evaluation (OPE) in the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates. This problem is widespread in recommender systems, search engines, marketing, to medical applications, however, the typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces, making effective OPE a significant challenge. The PseudoInverse (PI) estimator has been introduced to mitigate the variance issue by assuming linearity in the reward function, but this can result in significant bias as this assumption is hard-to-verify from observed data and is often substantially violated. To address the limitations of previous estimators, we develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space where we optimize slate abstractions to minimize the bias and variance of LIPS in a data-driven way. By doing so, LIPS can substantially reduce the variance of IPS without imposing restrictive assumptions on the reward function structure like linearity. Through empirical evaluation, we demonstrate that LIPS substantially outperforms existing estimators, particularly in scenarios with non-linear rewards and large slate spaces.
title	Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction
topic	Machine Learning
url	https://arxiv.org/abs/2402.02171

Similar Items