Saved in:
Bibliographic Details
Main Authors: Hou, Yujie, Wang, Mei, Zhong, Yaoyao, Zhang, Ting, Ma, Xuetao, Huang, Hua
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.16646
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913046652256256
author Hou, Yujie
Wang, Mei
Zhong, Yaoyao
Zhang, Ting
Ma, Xuetao
Huang, Hua
author_facet Hou, Yujie
Wang, Mei
Zhong, Yaoyao
Zhang, Ting
Ma, Xuetao
Huang, Hua
contents Large Language Models (LLMs) have achieved remarkable performance across a wide range of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine reasoning or superficial pattern recognition. Existing evaluation methods, which typically focus either on the final answer or on the intermediate reasoning steps, reduce mathematical reasoning to a shallow input-output mapping, overlooking its inherently multi-stage and multi-dimensional cognitive nature. Inspired by Polya's problem-solving theory, we propose SMART, a benchmark that decomposes mathematical problem-solving into four cognitive dimensions: Semantic Understanding, Mathematical Reasoning, Arithmetic Computation, and Reflection & Refinement, and introduces dimension-specific tasks to measure the corresponding cognitive processes of LLMs. We apply SMART to 22 state-of-the-art open- and closed-source LLMs and uncover substantial discrepancies in their capabilities across dimensions. Our findings reveal genuine weaknesses in current models and motivate a new metric, the All-Pass Score, designed to better capture true problem-solving capability.
format Preprint
id arxiv_https___arxiv_org_abs_2505_16646
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving
Hou, Yujie
Wang, Mei
Zhong, Yaoyao
Zhang, Ting
Ma, Xuetao
Huang, Hua
Artificial Intelligence
Large Language Models (LLMs) have achieved remarkable performance across a wide range of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine reasoning or superficial pattern recognition. Existing evaluation methods, which typically focus either on the final answer or on the intermediate reasoning steps, reduce mathematical reasoning to a shallow input-output mapping, overlooking its inherently multi-stage and multi-dimensional cognitive nature. Inspired by Polya's problem-solving theory, we propose SMART, a benchmark that decomposes mathematical problem-solving into four cognitive dimensions: Semantic Understanding, Mathematical Reasoning, Arithmetic Computation, and Reflection & Refinement, and introduces dimension-specific tasks to measure the corresponding cognitive processes of LLMs. We apply SMART to 22 state-of-the-art open- and closed-source LLMs and uncover substantial discrepancies in their capabilities across dimensions. Our findings reveal genuine weaknesses in current models and motivate a new metric, the All-Pass Score, designed to better capture true problem-solving capability.
title SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving
topic Artificial Intelligence
url https://arxiv.org/abs/2505.16646