Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zheng, Xiang, Zhai, Weiqi, Wang, Wei, Yang, Boyu, Li, Wenbo, Luo, Ruixiang, Sun, Haoxiang, Wang, Yucheng, Li, Zhengze, Wang, Meng, Du, Yuetian, Lin, Guojie, Wang, Yaxuan, Xu, Xiaoxiao, Mo, Yanhu, Ren, Xuan, Wei, Hu, Zhao, Bing
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2602.00564
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917296525541376
author	Zheng, Xiang Zhai, Weiqi Wang, Wei Yang, Boyu Li, Wenbo Luo, Ruixiang Sun, Haoxiang Wang, Yucheng Li, Zhengze Wang, Meng Du, Yuetian Lin, Guojie Wang, Yaxuan Xu, Xiaoxiao Mo, Yanhu Ren, Xuan Wei, Hu Zhao, Bing
author_facet	Zheng, Xiang Zhai, Weiqi Wang, Wei Yang, Boyu Li, Wenbo Luo, Ruixiang Sun, Haoxiang Wang, Yucheng Li, Zhengze Wang, Meng Du, Yuetian Lin, Guojie Wang, Yaxuan Xu, Xiaoxiao Mo, Yanhu Ren, Xuan Wei, Hu Zhao, Bing
contents	Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce HCRS (Hazard-aware Chain-based Rule Score), a deterministic step-level scoring function, and train a Process Reward Model (PRM) on the annotated reasoning traces. Empirically, while leading models attain relatively high final-answer accuracy (up to 5.8/10), HCRS-based holistic evaluation yields substantially lower scores (average 4.36/10, best 5.14/10), showing that answer-only metrics can overestimate reasoning robustness.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_00564
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs Zheng, Xiang Zhai, Weiqi Wang, Wei Yang, Boyu Li, Wenbo Luo, Ruixiang Sun, Haoxiang Wang, Yucheng Li, Zhengze Wang, Meng Du, Yuetian Lin, Guojie Wang, Yaxuan Xu, Xiaoxiao Mo, Yanhu Ren, Xuan Wei, Hu Zhao, Bing Artificial Intelligence Computation and Language Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce HCRS (Hazard-aware Chain-based Rule Score), a deterministic step-level scoring function, and train a Process Reward Model (PRM) on the annotated reasoning traces. Empirically, while leading models attain relatively high final-answer accuracy (up to 5.8/10), HCRS-based holistic evaluation yields substantially lower scores (average 4.36/10, best 5.14/10), showing that answer-only metrics can overestimate reasoning robustness.
title	Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
topic	Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2602.00564

Similar Items