Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhou, Runlong, Du, Simon S., Li, Beibin
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2402.12621
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911907320954880
author	Zhou, Runlong Du, Simon S. Li, Beibin
author_facet	Zhou, Runlong Du, Simon S. Li, Beibin
contents	As language models (LMs) demonstrate their capabilities in various fields, their application to tasks requiring multi-round interactions has become increasingly popular. These tasks usually have complex dynamics, so supervised fine-tuning (SFT) on a limited offline dataset does not yield good performance. However, only a few works attempted to directly train the LMs within interactive decision-making environments. We aim to create an effective approach to fine-tune LMs with online reinforcement learning (RL) in these environments. We propose Reflect-RL, a two-player system to fine-tune an LM using SFT and online RL, where a frozen reflection model (player) assists the policy model (player). To generate data for the warm-up SFT stage, we use negative example generation to enhance the error-correction ability of the reflection model. Furthermore, we designed single-prompt action enumeration and applied curriculum learning to allow the policy model to learn more efficiently. Empirically, we verify that Reflect-RL outperforms SFT and online RL without reflection. Testing results indicate GPT-2 XL 1.56B fine-tuned with Reflect-RL outperforms larger open-source LMs, such as Mistral 7B. The benchmarks, dataset, and code involved in this work are publicly available: https://github.com/zhourunlong/Reflect-RL.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_12621
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Reflect-RL: Two-Player Online RL Fine-Tuning for LMs Zhou, Runlong Du, Simon S. Li, Beibin Machine Learning Computation and Language As language models (LMs) demonstrate their capabilities in various fields, their application to tasks requiring multi-round interactions has become increasingly popular. These tasks usually have complex dynamics, so supervised fine-tuning (SFT) on a limited offline dataset does not yield good performance. However, only a few works attempted to directly train the LMs within interactive decision-making environments. We aim to create an effective approach to fine-tune LMs with online reinforcement learning (RL) in these environments. We propose Reflect-RL, a two-player system to fine-tune an LM using SFT and online RL, where a frozen reflection model (player) assists the policy model (player). To generate data for the warm-up SFT stage, we use negative example generation to enhance the error-correction ability of the reflection model. Furthermore, we designed single-prompt action enumeration and applied curriculum learning to allow the policy model to learn more efficiently. Empirically, we verify that Reflect-RL outperforms SFT and online RL without reflection. Testing results indicate GPT-2 XL 1.56B fine-tuned with Reflect-RL outperforms larger open-source LMs, such as Mistral 7B. The benchmarks, dataset, and code involved in this work are publicly available: https://github.com/zhourunlong/Reflect-RL.
title	Reflect-RL: Two-Player Online RL Fine-Tuning for LMs
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2402.12621

Similar Items