Guardado en:
Detalles Bibliográficos
Autores principales: Jin, Jing, Liu, Hao, Bai, Yan, Lou, Yihang, Wang, Zhenke, Yuan, Tianrun, Chen, Juntong, Zhu, Yongkang, Zeng, Fanhu, Zhu, Xuanyu, Feng, Tao, Xu, Yige
Formato: Preprint
Publicado: 2026
Materias:
Acceso en línea:https://arxiv.org/abs/2604.19697
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866917472985153536
author Jin, Jing
Liu, Hao
Bai, Yan
Lou, Yihang
Wang, Zhenke
Yuan, Tianrun
Chen, Juntong
Zhu, Yongkang
Zeng, Fanhu
Zhu, Xuanyu
Feng, Tao
Xu, Yige
author_facet Jin, Jing
Liu, Hao
Bai, Yan
Lou, Yihang
Wang, Zhenke
Yuan, Tianrun
Chen, Juntong
Zhu, Yongkang
Zeng, Fanhu
Zhu, Xuanyu
Feng, Tao
Xu, Yige
contents Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.
format Preprint
id arxiv_https___arxiv_org_abs_2604_19697
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
Jin, Jing
Liu, Hao
Bai, Yan
Lou, Yihang
Wang, Zhenke
Yuan, Tianrun
Chen, Juntong
Zhu, Yongkang
Zeng, Fanhu
Zhu, Xuanyu
Feng, Tao
Xu, Yige
Computer Vision and Pattern Recognition
Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.
title Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2604.19697