Saved in:
Bibliographic Details
Main Authors: Zhou, Runlong, Du, Simon S., Li, Beibin
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2402.12621
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911907320954880
author Zhou, Runlong
Du, Simon S.
Li, Beibin
author_facet Zhou, Runlong
Du, Simon S.
Li, Beibin
contents As language models (LMs) demonstrate their capabilities in various fields, their application to tasks requiring multi-round interactions has become increasingly popular. These tasks usually have complex dynamics, so supervised fine-tuning (SFT) on a limited offline dataset does not yield good performance. However, only a few works attempted to directly train the LMs within interactive decision-making environments. We aim to create an effective approach to fine-tune LMs with online reinforcement learning (RL) in these environments. We propose Reflect-RL, a two-player system to fine-tune an LM using SFT and online RL, where a frozen reflection model (player) assists the policy model (player). To generate data for the warm-up SFT stage, we use negative example generation to enhance the error-correction ability of the reflection model. Furthermore, we designed single-prompt action enumeration and applied curriculum learning to allow the policy model to learn more efficiently. Empirically, we verify that Reflect-RL outperforms SFT and online RL without reflection. Testing results indicate GPT-2 XL 1.56B fine-tuned with Reflect-RL outperforms larger open-source LMs, such as Mistral 7B. The benchmarks, dataset, and code involved in this work are publicly available: https://github.com/zhourunlong/Reflect-RL.
format Preprint
id arxiv_https___arxiv_org_abs_2402_12621
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Reflect-RL: Two-Player Online RL Fine-Tuning for LMs
Zhou, Runlong
Du, Simon S.
Li, Beibin
Machine Learning
Computation and Language
As language models (LMs) demonstrate their capabilities in various fields, their application to tasks requiring multi-round interactions has become increasingly popular. These tasks usually have complex dynamics, so supervised fine-tuning (SFT) on a limited offline dataset does not yield good performance. However, only a few works attempted to directly train the LMs within interactive decision-making environments. We aim to create an effective approach to fine-tune LMs with online reinforcement learning (RL) in these environments. We propose Reflect-RL, a two-player system to fine-tune an LM using SFT and online RL, where a frozen reflection model (player) assists the policy model (player). To generate data for the warm-up SFT stage, we use negative example generation to enhance the error-correction ability of the reflection model. Furthermore, we designed single-prompt action enumeration and applied curriculum learning to allow the policy model to learn more efficiently. Empirically, we verify that Reflect-RL outperforms SFT and online RL without reflection. Testing results indicate GPT-2 XL 1.56B fine-tuned with Reflect-RL outperforms larger open-source LMs, such as Mistral 7B. The benchmarks, dataset, and code involved in this work are publicly available: https://github.com/zhourunlong/Reflect-RL.
title Reflect-RL: Two-Player Online RL Fine-Tuning for LMs
topic Machine Learning
Computation and Language
url https://arxiv.org/abs/2402.12621