Saved in:
Bibliographic Details
Main Authors: Rietz, Finn, Smirnov, Oleg, Karimi, Sara, Cao, Lele
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.06358
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915397310087168
author Rietz, Finn
Smirnov, Oleg
Karimi, Sara
Cao, Lele
author_facet Rietz, Finn
Smirnov, Oleg
Karimi, Sara
Cao, Lele
contents Prompting has emerged as the dominant paradigm for adapting large, pre-trained transformer-based models to downstream tasks. The Prompting Decision Transformer (PDT) enables large-scale, multi-task offline Reinforcement Learning (RL) pre-training by leveraging stochastic trajectory prompts to identify the target task. However, these prompts are sampled uniformly from expert demonstrations, overlooking a critical limitation: not all prompts are equally informative for differentiating between tasks. This limits generalization and adaptation, especially in low-data or open-world settings where sample efficiency is crucial. To address this issue, we propose a lightweight, inference-time, bandit-based prompt-tuning framework. The bandit explores and optimizes trajectory prompt selection to enhance task performance, while avoiding costly fine-tuning of the transformer backbone. Our experiments indicate not only clear performance gains due to bandit-based prompt-tuning, but also better sample complexity, scalability, and prompt space exploration compared to prompt-tuning baselines. These results highlights the importance of adaptive prompt selection mechanisms for efficient generalization in offline multi-task RL.
format Preprint
id arxiv_https___arxiv_org_abs_2502_06358
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Prompt-Tuning Bandits: Enabling Few-Shot Generalization for Efficient Multi-Task Offline RL
Rietz, Finn
Smirnov, Oleg
Karimi, Sara
Cao, Lele
Machine Learning
Prompting has emerged as the dominant paradigm for adapting large, pre-trained transformer-based models to downstream tasks. The Prompting Decision Transformer (PDT) enables large-scale, multi-task offline Reinforcement Learning (RL) pre-training by leveraging stochastic trajectory prompts to identify the target task. However, these prompts are sampled uniformly from expert demonstrations, overlooking a critical limitation: not all prompts are equally informative for differentiating between tasks. This limits generalization and adaptation, especially in low-data or open-world settings where sample efficiency is crucial. To address this issue, we propose a lightweight, inference-time, bandit-based prompt-tuning framework. The bandit explores and optimizes trajectory prompt selection to enhance task performance, while avoiding costly fine-tuning of the transformer backbone. Our experiments indicate not only clear performance gains due to bandit-based prompt-tuning, but also better sample complexity, scalability, and prompt space exploration compared to prompt-tuning baselines. These results highlights the importance of adaptive prompt selection mechanisms for efficient generalization in offline multi-task RL.
title Prompt-Tuning Bandits: Enabling Few-Shot Generalization for Efficient Multi-Task Offline RL
topic Machine Learning
url https://arxiv.org/abs/2502.06358