Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hou, Yujie, Wang, Mei, Zhong, Yaoyao, Zhang, Ting, Ma, Xuetao, Huang, Hua
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.16646
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913046652256256
author	Hou, Yujie Wang, Mei Zhong, Yaoyao Zhang, Ting Ma, Xuetao Huang, Hua
author_facet	Hou, Yujie Wang, Mei Zhong, Yaoyao Zhang, Ting Ma, Xuetao Huang, Hua
contents	Large Language Models (LLMs) have achieved remarkable performance across a wide range of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine reasoning or superficial pattern recognition. Existing evaluation methods, which typically focus either on the final answer or on the intermediate reasoning steps, reduce mathematical reasoning to a shallow input-output mapping, overlooking its inherently multi-stage and multi-dimensional cognitive nature. Inspired by Polya's problem-solving theory, we propose SMART, a benchmark that decomposes mathematical problem-solving into four cognitive dimensions: Semantic Understanding, Mathematical Reasoning, Arithmetic Computation, and Reflection & Refinement, and introduces dimension-specific tasks to measure the corresponding cognitive processes of LLMs. We apply SMART to 22 state-of-the-art open- and closed-source LLMs and uncover substantial discrepancies in their capabilities across dimensions. Our findings reveal genuine weaknesses in current models and motivate a new metric, the All-Pass Score, designed to better capture true problem-solving capability.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_16646
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving Hou, Yujie Wang, Mei Zhong, Yaoyao Zhang, Ting Ma, Xuetao Huang, Hua Artificial Intelligence Large Language Models (LLMs) have achieved remarkable performance across a wide range of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine reasoning or superficial pattern recognition. Existing evaluation methods, which typically focus either on the final answer or on the intermediate reasoning steps, reduce mathematical reasoning to a shallow input-output mapping, overlooking its inherently multi-stage and multi-dimensional cognitive nature. Inspired by Polya's problem-solving theory, we propose SMART, a benchmark that decomposes mathematical problem-solving into four cognitive dimensions: Semantic Understanding, Mathematical Reasoning, Arithmetic Computation, and Reflection & Refinement, and introduces dimension-specific tasks to measure the corresponding cognitive processes of LLMs. We apply SMART to 22 state-of-the-art open- and closed-source LLMs and uncover substantial discrepancies in their capabilities across dimensions. Our findings reveal genuine weaknesses in current models and motivate a new metric, the All-Pass Score, designed to better capture true problem-solving capability.
title	SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving
topic	Artificial Intelligence
url	https://arxiv.org/abs/2505.16646

Similar Items