Saved in:
| Main Authors: | Jian, Ai, Qiu, Weijie, Wang, Xiaokun, Wang, Peiyu, Hao, Yunzhuo, Pei, Jiangbo, Wei, Yichen, Peng, Yi, Song, Xuchen |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.24120 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
by: Wang, Xiaokun, et al.
Published: (2025)
by: Wang, Xiaokun, et al.
Published: (2025)
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
by: Peng, Yi, et al.
Published: (2025)
by: Peng, Yi, et al.
Published: (2025)
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
by: Wang, Peiyu, et al.
Published: (2025)
by: Wang, Peiyu, et al.
Published: (2025)
Skywork-R1V3 Technical Report
by: Shen, Wei, et al.
Published: (2025)
by: Shen, Wei, et al.
Published: (2025)
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch
by: Zhang, Yifan, et al.
Published: (2025)
by: Zhang, Yifan, et al.
Published: (2025)
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
by: Hao, Yunzhuo, et al.
Published: (2025)
by: Hao, Yunzhuo, et al.
Published: (2025)
STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models
by: Li, Xuzhao, et al.
Published: (2026)
by: Li, Xuzhao, et al.
Published: (2026)
GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation
by: Zong, Yi, et al.
Published: (2024)
by: Zong, Yi, et al.
Published: (2024)
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation
by: Wang, Peiyu, et al.
Published: (2025)
by: Wang, Peiyu, et al.
Published: (2025)
DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning
by: Xu, Xinrun, et al.
Published: (2025)
by: Xu, Xinrun, et al.
Published: (2025)
VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs
by: Törtei, Brigitta Malagurski, et al.
Published: (2025)
by: Törtei, Brigitta Malagurski, et al.
Published: (2025)
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
by: Jin, Jing, et al.
Published: (2026)
by: Jin, Jing, et al.
Published: (2026)
ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
by: Gu, Jiawei, et al.
Published: (2025)
by: Gu, Jiawei, et al.
Published: (2025)
ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing
by: Wang, Jinzhi, et al.
Published: (2025)
by: Wang, Jinzhi, et al.
Published: (2025)
SUPERChem: A Multimodal Reasoning Benchmark in Chemistry
by: Zhao, Zehua, et al.
Published: (2025)
by: Zhao, Zehua, et al.
Published: (2025)
The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark
by: Liu, Hao, et al.
Published: (2026)
by: Liu, Hao, et al.
Published: (2026)
VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains
by: Li, Xuzhao, et al.
Published: (2025)
by: Li, Xuzhao, et al.
Published: (2025)
SID: Benchmarking Guided Instruction Capabilities in STEM Education with a Socratic Interdisciplinary Dialogues Dataset
by: Jiang, Mei, et al.
Published: (2025)
by: Jiang, Mei, et al.
Published: (2025)
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
by: Carbune, Victor, et al.
Published: (2024)
by: Carbune, Victor, et al.
Published: (2024)
Visual Language Tracking with Multi-modal Interaction: A Robust Benchmark
by: Li, Xuchen, et al.
Published: (2024)
by: Li, Xuchen, et al.
Published: (2024)
Design Principles for the Construction of a Benchmark Evaluating Security Operation Capabilities of Multi-agent AI Systems
by: Cai, Yicheng, et al.
Published: (2026)
by: Cai, Yicheng, et al.
Published: (2026)
MMAFFBen: A Multilingual and Multimodal Affective Analysis Benchmark for Evaluating LLMs and VLMs
by: Liu, Zhiwei, et al.
Published: (2025)
by: Liu, Zhiwei, et al.
Published: (2025)
TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs
by: Xu, Pengju, et al.
Published: (2025)
by: Xu, Pengju, et al.
Published: (2025)
SubstationAI: Multimodal Large Model-Based Approaches for Analyzing Substation Equipment Faults
by: Wang, Jinzhi, et al.
Published: (2024)
by: Wang, Jinzhi, et al.
Published: (2024)
STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples
by: Hu, Haiquan, et al.
Published: (2025)
by: Hu, Haiquan, et al.
Published: (2025)
MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification
by: Xu, Zhaopan, et al.
Published: (2025)
by: Xu, Zhaopan, et al.
Published: (2025)
DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM
by: Li, Xuchen, et al.
Published: (2024)
by: Li, Xuchen, et al.
Published: (2024)
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models
by: Wang, Hengyi, et al.
Published: (2024)
by: Wang, Hengyi, et al.
Published: (2024)
TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models
by: Li, Ce, et al.
Published: (2025)
by: Li, Ce, et al.
Published: (2025)
CHBench: A Cognitive Hierarchy Benchmark for Evaluating Strategic Reasoning Capability of LLMs
by: Liu, Hongtao, et al.
Published: (2025)
by: Liu, Hongtao, et al.
Published: (2025)
Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities
by: Ying, Shuangshuang, et al.
Published: (2026)
by: Ying, Shuangshuang, et al.
Published: (2026)
AMSbench: A Comprehensive Benchmark for Evaluating MLLM Capabilities in AMS Circuits
by: Shi, Yichen, et al.
Published: (2025)
by: Shi, Yichen, et al.
Published: (2025)
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models
by: Zhou, Pengfei, et al.
Published: (2025)
by: Zhou, Pengfei, et al.
Published: (2025)
Benchmarking VLMs' Reasoning About Persuasive Atypical Images
by: Malakouti, Sina, et al.
Published: (2024)
by: Malakouti, Sina, et al.
Published: (2024)
VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs
by: Khezresmaeilzadeh, Tina, et al.
Published: (2026)
by: Khezresmaeilzadeh, Tina, et al.
Published: (2026)
Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark
by: Cheng, Ziming, et al.
Published: (2025)
by: Cheng, Ziming, et al.
Published: (2025)
ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation
by: Oh, Jungwoo, et al.
Published: (2026)
by: Oh, Jungwoo, et al.
Published: (2026)
VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs
by: Berman, Shmuel, et al.
Published: (2025)
by: Berman, Shmuel, et al.
Published: (2025)
Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model
by: Wei, Hongyang, et al.
Published: (2025)
by: Wei, Hongyang, et al.
Published: (2025)
MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models
by: Wang, Wentian, et al.
Published: (2024)
by: Wang, Wentian, et al.
Published: (2024)
Similar Items
-
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
by: Wang, Xiaokun, et al.
Published: (2025) -
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
by: Peng, Yi, et al.
Published: (2025) -
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
by: Wang, Peiyu, et al.
Published: (2025) -
Skywork-R1V3 Technical Report
by: Shen, Wei, et al.
Published: (2025) -
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch
by: Zhang, Yifan, et al.
Published: (2025)