Saved in:
Bibliographic Details
Main Authors: Ji, Kaixuan, Liu, Guanlin, Dai, Ning, Yang, Qingping, Zheng, Renjie, Wu, Zheng, Dun, Chen, Gu, Quanquan, Yan, Lin
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2410.09302
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915145126510592
author Ji, Kaixuan
Liu, Guanlin
Dai, Ning
Yang, Qingping
Zheng, Renjie
Wu, Zheng
Dun, Chen
Gu, Quanquan
Yan, Lin
author_facet Ji, Kaixuan
Liu, Guanlin
Dai, Ning
Yang, Qingping
Zheng, Renjie
Wu, Zheng
Dun, Chen
Gu, Quanquan
Yan, Lin
contents Reinforcement Learning (RL) plays a crucial role in aligning large language models (LLMs) with human preferences and improving their ability to perform complex tasks. However, current approaches either require significant computational resources due to the use of multiple models and extensive online sampling for training (e.g., PPO) or are framed as bandit problems (e.g., DPO, DRO), which often struggle with multi-step reasoning tasks, such as math problem solving and complex reasoning that involve long chains of thought. To overcome these limitations, we introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model. The MDP formulation of DQO offers structural advantages over bandit-based methods, enabling more effective process supervision. Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.
format Preprint
id arxiv_https___arxiv_org_abs_2410_09302
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization
Ji, Kaixuan
Liu, Guanlin
Dai, Ning
Yang, Qingping
Zheng, Renjie
Wu, Zheng
Dun, Chen
Gu, Quanquan
Yan, Lin
Machine Learning
Artificial Intelligence
Computation and Language
Reinforcement Learning (RL) plays a crucial role in aligning large language models (LLMs) with human preferences and improving their ability to perform complex tasks. However, current approaches either require significant computational resources due to the use of multiple models and extensive online sampling for training (e.g., PPO) or are framed as bandit problems (e.g., DPO, DRO), which often struggle with multi-step reasoning tasks, such as math problem solving and complex reasoning that involve long chains of thought. To overcome these limitations, we introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model. The MDP formulation of DQO offers structural advantages over bandit-based methods, enabling more effective process supervision. Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.
title Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization
topic Machine Learning
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2410.09302