Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Jin, Jing, Liu, Hao, Bai, Yan, Lou, Yihang, Wang, Zhenke, Yuan, Tianrun, Chen, Juntong, Zhu, Yongkang, Zeng, Fanhu, Zhu, Xuanyu, Feng, Tao, Xu, Yige
Formato:	Preprint
Publicado:	2026
Materias:	Computer Vision and Pattern Recognition
Acceso en línea:	https://arxiv.org/abs/2604.19697
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866917472985153536
author	Jin, Jing Liu, Hao Bai, Yan Lou, Yihang Wang, Zhenke Yuan, Tianrun Chen, Juntong Zhu, Yongkang Zeng, Fanhu Zhu, Xuanyu Feng, Tao Xu, Yige
author_facet	Jin, Jing Liu, Hao Bai, Yan Lou, Yihang Wang, Zhenke Yuan, Tianrun Chen, Juntong Zhu, Yongkang Zeng, Fanhu Zhu, Xuanyu Feng, Tao Xu, Yige
contents	Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_19697
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks Jin, Jing Liu, Hao Bai, Yan Lou, Yihang Wang, Zhenke Yuan, Tianrun Chen, Juntong Zhu, Yongkang Zeng, Fanhu Zhu, Xuanyu Feng, Tao Xu, Yige Computer Vision and Pattern Recognition Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.
title	Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2604.19697

Ejemplares similares