Saved in:
Bibliographic Details
Main Authors: Li, Xuzhao, Li, Xuchen, Zhao, Jian, Hu, Shiyu
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.02497
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910009603915776
author Li, Xuzhao
Li, Xuchen
Zhao, Jian
Hu, Shiyu
author_facet Li, Xuzhao
Li, Xuchen
Zhao, Jian
Hu, Shiyu
contents As Large Language Models (LLMs) achieve significant breakthroughs in complex reasoning tasks, evaluating their proficiency in science, technology, engineering, and mathematics (STEM) has become a primary method for measuring machine intelligence. However, current evaluation paradigms often treat benchmarks as isolated "silos," offering only monolithic aggregate scores that neglect the intricacies of both academic specialization and cognitive depth. This result-oriented approach fails to distinguish whether model errors stem from insufficient domain knowledge or deficiencies in cognitive capacity, thereby limiting the diagnostic value. To address this, we propose STEMVerse, a diagnostic framework designed to systematically analyze the STEM reasoning capabilities of LLMs. This framework characterizes model performance across academic specialization and cognitive complexity to map the capability required for reasoning. We re-aggregate over 20,000 STEM problems from mainstream benchmarks into a unified "Discipline $\times$ Cognition" capability space, assigning dual-axis labels to every instance. Utilizing this unified diagnostic framework, we systematically evaluate representative LLM families across varying parameter scales and training paradigms. Our empirical results reveal structural failure patterns in STEM reasoning. By integrating multi-disciplinary coverage and fine-grained cognitive stratification into a unified framework, STEMVerse provides a clear and actionable perspective for understanding the scientific reasoning characteristics of LLMs.
format Preprint
id arxiv_https___arxiv_org_abs_2602_02497
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models
Li, Xuzhao
Li, Xuchen
Zhao, Jian
Hu, Shiyu
Computation and Language
Artificial Intelligence
As Large Language Models (LLMs) achieve significant breakthroughs in complex reasoning tasks, evaluating their proficiency in science, technology, engineering, and mathematics (STEM) has become a primary method for measuring machine intelligence. However, current evaluation paradigms often treat benchmarks as isolated "silos," offering only monolithic aggregate scores that neglect the intricacies of both academic specialization and cognitive depth. This result-oriented approach fails to distinguish whether model errors stem from insufficient domain knowledge or deficiencies in cognitive capacity, thereby limiting the diagnostic value. To address this, we propose STEMVerse, a diagnostic framework designed to systematically analyze the STEM reasoning capabilities of LLMs. This framework characterizes model performance across academic specialization and cognitive complexity to map the capability required for reasoning. We re-aggregate over 20,000 STEM problems from mainstream benchmarks into a unified "Discipline $\times$ Cognition" capability space, assigning dual-axis labels to every instance. Utilizing this unified diagnostic framework, we systematically evaluate representative LLM families across varying parameter scales and training paradigms. Our empirical results reveal structural failure patterns in STEM reasoning. By integrating multi-disciplinary coverage and fine-grained cognitive stratification into a unified framework, STEMVerse provides a clear and actionable perspective for understanding the scientific reasoning characteristics of LLMs.
title STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2602.02497