Saved in:
Bibliographic Details
Main Authors: Cao, Lang, Chen, Renhong, Zou, Yingtian, Peng, Chao, Xu, Huacong, Wang, Yuxian, Ning, Wu, Chen, Qian, Peng, Mofan, Chen, Zijie, Su, Peishuo, Li, Yitong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.22233
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914378946707456
author Cao, Lang
Chen, Renhong
Zou, Yingtian
Peng, Chao
Xu, Huacong
Wang, Yuxian
Ning, Wu
Chen, Qian
Peng, Mofan
Chen, Zijie
Su, Peishuo
Li, Yitong
author_facet Cao, Lang
Chen, Renhong
Zou, Yingtian
Peng, Chao
Xu, Huacong
Wang, Yuxian
Ning, Wu
Chen, Qian
Peng, Mofan
Chen, Zijie
Su, Peishuo
Li, Yitong
contents We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.
format Preprint
id arxiv_https___arxiv_org_abs_2503_22233
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty
Cao, Lang
Chen, Renhong
Zou, Yingtian
Peng, Chao
Xu, Huacong
Wang, Yuxian
Ning, Wu
Chen, Qian
Peng, Mofan
Chen, Zijie
Su, Peishuo
Li, Yitong
Machine Learning
Artificial Intelligence
Computation and Language
We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.
title More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty
topic Machine Learning
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2503.22233