Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.00564 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866917296525541376 |
|---|---|
| author | Zheng, Xiang Zhai, Weiqi Wang, Wei Yang, Boyu Li, Wenbo Luo, Ruixiang Sun, Haoxiang Wang, Yucheng Li, Zhengze Wang, Meng Du, Yuetian Lin, Guojie Wang, Yaxuan Xu, Xiaoxiao Mo, Yanhu Ren, Xuan Wei, Hu Zhao, Bing |
| author_facet | Zheng, Xiang Zhai, Weiqi Wang, Wei Yang, Boyu Li, Wenbo Luo, Ruixiang Sun, Haoxiang Wang, Yucheng Li, Zhengze Wang, Meng Du, Yuetian Lin, Guojie Wang, Yaxuan Xu, Xiaoxiao Mo, Yanhu Ren, Xuan Wei, Hu Zhao, Bing |
| contents | Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce HCRS (Hazard-aware Chain-based Rule Score), a deterministic step-level scoring function, and train a Process Reward Model (PRM) on the annotated reasoning traces. Empirically, while leading models attain relatively high final-answer accuracy (up to 5.8/10), HCRS-based holistic evaluation yields substantially lower scores (average 4.36/10, best 5.14/10), showing that answer-only metrics can overestimate reasoning robustness. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_00564 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs Zheng, Xiang Zhai, Weiqi Wang, Wei Yang, Boyu Li, Wenbo Luo, Ruixiang Sun, Haoxiang Wang, Yucheng Li, Zhengze Wang, Meng Du, Yuetian Lin, Guojie Wang, Yaxuan Xu, Xiaoxiao Mo, Yanhu Ren, Xuan Wei, Hu Zhao, Bing Artificial Intelligence Computation and Language Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce HCRS (Hazard-aware Chain-based Rule Score), a deterministic step-level scoring function, and train a Process Reward Model (PRM) on the annotated reasoning traces. Empirically, while leading models attain relatively high final-answer accuracy (up to 5.8/10), HCRS-based holistic evaluation yields substantially lower scores (average 4.36/10, best 5.14/10), showing that answer-only metrics can overestimate reasoning robustness. |
| title | Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs |
| topic | Artificial Intelligence Computation and Language |
| url | https://arxiv.org/abs/2602.00564 |