Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Weixuan, Wu, Minghao, Haddow, Barry, Birch, Alexandra
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2502.12663
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908559683354624
author	Wang, Weixuan Wu, Minghao Haddow, Barry Birch, Alexandra
author_facet	Wang, Weixuan Wu, Minghao Haddow, Barry Birch, Alexandra
contents	Large language models (LLMs) are designed to perform a wide range of tasks. To improve their ability to solve complex problems requiring multi-step reasoning, recent research leverages process reward modeling to provide fine-grained feedback at each step of the reasoning process for reinforcement learning (RL), but it predominantly focuses on English. In this paper, we tackle the critical challenge of extending process reward models (PRMs) to multilingual settings. To achieve this, we train multilingual PRMs on a dataset spanning seven languages, which is translated from English. Through comprehensive evaluations on two widely used reasoning benchmarks across 11 languages, we demonstrate that multilingual PRMs not only improve average accuracy but also reduce early-stage reasoning errors. Furthermore, our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data, while also uncovering the benefits arising from more candidate responses and trainable parameters. This work opens promising avenues for robust multilingual applications in complex, multi-step reasoning tasks. In addition, we release the code to foster research along this line.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_12663
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Demystifying Multilingual Chain-of-Thought in Process Reward Modeling Wang, Weixuan Wu, Minghao Haddow, Barry Birch, Alexandra Computation and Language Large language models (LLMs) are designed to perform a wide range of tasks. To improve their ability to solve complex problems requiring multi-step reasoning, recent research leverages process reward modeling to provide fine-grained feedback at each step of the reasoning process for reinforcement learning (RL), but it predominantly focuses on English. In this paper, we tackle the critical challenge of extending process reward models (PRMs) to multilingual settings. To achieve this, we train multilingual PRMs on a dataset spanning seven languages, which is translated from English. Through comprehensive evaluations on two widely used reasoning benchmarks across 11 languages, we demonstrate that multilingual PRMs not only improve average accuracy but also reduce early-stage reasoning errors. Furthermore, our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data, while also uncovering the benefits arising from more candidate responses and trainable parameters. This work opens promising avenues for robust multilingual applications in complex, multi-step reasoning tasks. In addition, we release the code to foster research along this line.
title	Demystifying Multilingual Chain-of-Thought in Process Reward Modeling
topic	Computation and Language
url	https://arxiv.org/abs/2502.12663

Similar Items