Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mukherjee, Subhojyoti, Hanna, Josiah P., Xie, Qiaomin, Nowak, Robert
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2406.05064
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908606369103872
author	Mukherjee, Subhojyoti Hanna, Josiah P. Xie, Qiaomin Nowak, Robert
author_facet	Mukherjee, Subhojyoti Hanna, Josiah P. Xie, Qiaomin Nowak, Robert
contents	We study learning to learn for the multi-task structured bandit problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and an algorithm should exploit the shared structure to minimize the cumulative regret for an unseen but related test task. We use a transformer as a decision-making algorithm to learn this shared structure from data collected by a demonstrator on a set of training task instances. Our objective is to devise a training procedure such that the transformer will learn to outperform the demonstrator's learning algorithm on unseen test task instances. Prior work on pretraining decision transformers either requires privileged information like access to optimal arms or cannot outperform the demonstrator. Going beyond these approaches, we introduce a pre-training approach that trains a transformer network to learn a near-optimal policy in-context. This approach leverages the shared structure across tasks, does not require access to optimal actions, and can outperform the demonstrator. We validate these claims over a wide variety of structured bandit problems to show that our proposed solution is general and can quickly identify expected rewards on unseen test tasks to support effective exploration.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_05064
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning Mukherjee, Subhojyoti Hanna, Josiah P. Xie, Qiaomin Nowak, Robert Machine Learning We study learning to learn for the multi-task structured bandit problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and an algorithm should exploit the shared structure to minimize the cumulative regret for an unseen but related test task. We use a transformer as a decision-making algorithm to learn this shared structure from data collected by a demonstrator on a set of training task instances. Our objective is to devise a training procedure such that the transformer will learn to outperform the demonstrator's learning algorithm on unseen test task instances. Prior work on pretraining decision transformers either requires privileged information like access to optimal arms or cannot outperform the demonstrator. Going beyond these approaches, we introduce a pre-training approach that trains a transformer network to learn a near-optimal policy in-context. This approach leverages the shared structure across tasks, does not require access to optimal actions, and can outperform the demonstrator. We validate these claims over a wide variety of structured bandit problems to show that our proposed solution is general and can quickly identify expected rewards on unseen test tasks to support effective exploration.
title	Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning
topic	Machine Learning
url	https://arxiv.org/abs/2406.05064

Similar Items