Saved in:
Bibliographic Details
Main Authors: Zheng, Xiang, Zhai, Weiqi, Wang, Wei, Yang, Boyu, Li, Wenbo, Luo, Ruixiang, Sun, Haoxiang, Wang, Yucheng, Li, Zhengze, Wang, Meng, Du, Yuetian, Lin, Guojie, Wang, Yaxuan, Xu, Xiaoxiao, Mo, Yanhu, Ren, Xuan, Wei, Hu, Zhao, Bing
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.00564
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917296525541376
author Zheng, Xiang
Zhai, Weiqi
Wang, Wei
Yang, Boyu
Li, Wenbo
Luo, Ruixiang
Sun, Haoxiang
Wang, Yucheng
Li, Zhengze
Wang, Meng
Du, Yuetian
Lin, Guojie
Wang, Yaxuan
Xu, Xiaoxiao
Mo, Yanhu
Ren, Xuan
Wei, Hu
Zhao, Bing
author_facet Zheng, Xiang
Zhai, Weiqi
Wang, Wei
Yang, Boyu
Li, Wenbo
Luo, Ruixiang
Sun, Haoxiang
Wang, Yucheng
Li, Zhengze
Wang, Meng
Du, Yuetian
Lin, Guojie
Wang, Yaxuan
Xu, Xiaoxiao
Mo, Yanhu
Ren, Xuan
Wei, Hu
Zhao, Bing
contents Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce HCRS (Hazard-aware Chain-based Rule Score), a deterministic step-level scoring function, and train a Process Reward Model (PRM) on the annotated reasoning traces. Empirically, while leading models attain relatively high final-answer accuracy (up to 5.8/10), HCRS-based holistic evaluation yields substantially lower scores (average 4.36/10, best 5.14/10), showing that answer-only metrics can overestimate reasoning robustness.
format Preprint
id arxiv_https___arxiv_org_abs_2602_00564
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
Zheng, Xiang
Zhai, Weiqi
Wang, Wei
Yang, Boyu
Li, Wenbo
Luo, Ruixiang
Sun, Haoxiang
Wang, Yucheng
Li, Zhengze
Wang, Meng
Du, Yuetian
Lin, Guojie
Wang, Yaxuan
Xu, Xiaoxiao
Mo, Yanhu
Ren, Xuan
Wei, Hu
Zhao, Bing
Artificial Intelligence
Computation and Language
Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce HCRS (Hazard-aware Chain-based Rule Score), a deterministic step-level scoring function, and train a Process Reward Model (PRM) on the annotated reasoning traces. Empirically, while leading models attain relatively high final-answer accuracy (up to 5.8/10), HCRS-based holistic evaluation yields substantially lower scores (average 4.36/10, best 5.14/10), showing that answer-only metrics can overestimate reasoning robustness.
title Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
topic Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2602.00564