Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ji, Kaixuan, Liu, Guanlin, Dai, Ning, Yang, Qingping, Zheng, Renjie, Wu, Zheng, Dun, Chen, Gu, Quanquan, Yan, Lin
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2410.09302
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915145126510592
author	Ji, Kaixuan Liu, Guanlin Dai, Ning Yang, Qingping Zheng, Renjie Wu, Zheng Dun, Chen Gu, Quanquan Yan, Lin
author_facet	Ji, Kaixuan Liu, Guanlin Dai, Ning Yang, Qingping Zheng, Renjie Wu, Zheng Dun, Chen Gu, Quanquan Yan, Lin
contents	Reinforcement Learning (RL) plays a crucial role in aligning large language models (LLMs) with human preferences and improving their ability to perform complex tasks. However, current approaches either require significant computational resources due to the use of multiple models and extensive online sampling for training (e.g., PPO) or are framed as bandit problems (e.g., DPO, DRO), which often struggle with multi-step reasoning tasks, such as math problem solving and complex reasoning that involve long chains of thought. To overcome these limitations, we introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model. The MDP formulation of DQO offers structural advantages over bandit-based methods, enabling more effective process supervision. Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_09302
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization Ji, Kaixuan Liu, Guanlin Dai, Ning Yang, Qingping Zheng, Renjie Wu, Zheng Dun, Chen Gu, Quanquan Yan, Lin Machine Learning Artificial Intelligence Computation and Language Reinforcement Learning (RL) plays a crucial role in aligning large language models (LLMs) with human preferences and improving their ability to perform complex tasks. However, current approaches either require significant computational resources due to the use of multiple models and extensive online sampling for training (e.g., PPO) or are framed as bandit problems (e.g., DPO, DRO), which often struggle with multi-step reasoning tasks, such as math problem solving and complex reasoning that involve long chains of thought. To overcome these limitations, we introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model. The MDP formulation of DQO offers structural advantages over bandit-based methods, enabling more effective process supervision. Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.
title	Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization
topic	Machine Learning Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2410.09302

Similar Items