Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Agarwal, Mridul, Aggarwal, Vaneet, Quinn, Christopher J., Umrawal, Abhishek
Format:	Preprint
Published:	2020
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2011.07687
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915795075858432
author	Agarwal, Mridul Aggarwal, Vaneet Quinn, Christopher J. Umrawal, Abhishek
author_facet	Agarwal, Mridul Aggarwal, Vaneet Quinn, Christopher J. Umrawal, Abhishek
contents	We consider the bandit problem of selecting $K$ out of $N$ arms at each time step. The reward can be a non-linear function of the rewards of the selected individual arms. The direct use of a multi-armed bandit algorithm requires choosing among $\binom{N}{K}$ options, making the action space large. To simplify the problem, existing works on combinatorial bandits {typically} assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for top-$K$ subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without using individual arm feedback or requiring linearity of the reward function. Additionally, our algorithm works on correlated rewards of individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds. DART is computationally efficient and uses storage linear in $N$. Further, DART achieves a regret bound of $\tilde{\mathcal{O}}(K\sqrt{KNT})$ for a time horizon $T$, which matches the lower bound in bandit feedback up to a factor of $\sqrt{\log{2NT}}$. When applied to the problem of cross-selling optimization and maximizing the mean of individual rewards, the performance of the proposed algorithm surpasses that of state-of-the-art algorithms. We also show that DART significantly outperforms existing methods for both linear and non-linear joint reward environments.
format	Preprint
id	arxiv_https___arxiv_org_abs_2011_07687
institution	arXiv
publishDate	2020
record_format	arxiv
spellingShingle	DART: aDaptive Accept RejecT for non-linear top-K subset identification Agarwal, Mridul Aggarwal, Vaneet Quinn, Christopher J. Umrawal, Abhishek Machine Learning We consider the bandit problem of selecting $K$ out of $N$ arms at each time step. The reward can be a non-linear function of the rewards of the selected individual arms. The direct use of a multi-armed bandit algorithm requires choosing among $\binom{N}{K}$ options, making the action space large. To simplify the problem, existing works on combinatorial bandits {typically} assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for top-$K$ subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without using individual arm feedback or requiring linearity of the reward function. Additionally, our algorithm works on correlated rewards of individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds. DART is computationally efficient and uses storage linear in $N$. Further, DART achieves a regret bound of $\tilde{\mathcal{O}}(K\sqrt{KNT})$ for a time horizon $T$, which matches the lower bound in bandit feedback up to a factor of $\sqrt{\log{2NT}}$. When applied to the problem of cross-selling optimization and maximizing the mean of individual rewards, the performance of the proposed algorithm surpasses that of state-of-the-art algorithms. We also show that DART significantly outperforms existing methods for both linear and non-linear joint reward environments.
title	DART: aDaptive Accept RejecT for non-linear top-K subset identification
topic	Machine Learning
url	https://arxiv.org/abs/2011.07687

Similar Items