Saved in:
| Main Authors: | Wang, Xinhe, Huang, Jin, Zhang, Xingjian, Wang, Tianhao, Ma, Jiaqi W. |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.21329 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
by: Vaishnav, Mohit, et al.
Published: (2026)
by: Vaishnav, Mohit, et al.
Published: (2026)
Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective
by: Ma, Qingchuan, et al.
Published: (2025)
by: Ma, Qingchuan, et al.
Published: (2025)
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
by: Ma, David, et al.
Published: (2025)
by: Ma, David, et al.
Published: (2025)
Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective
by: You, Wangjie, et al.
Published: (2025)
by: You, Wangjie, et al.
Published: (2025)
PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
by: Li, Shaoxuan, et al.
Published: (2026)
by: Li, Shaoxuan, et al.
Published: (2026)
UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models
by: Xu, Xin, et al.
Published: (2025)
by: Xu, Xin, et al.
Published: (2025)
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
by: Zhao, Bingchen, et al.
Published: (2024)
by: Zhao, Bingchen, et al.
Published: (2024)
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist
by: Zhou, Zihao, et al.
Published: (2024)
by: Zhou, Zihao, et al.
Published: (2024)
ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
by: Liu, Hongwei, et al.
Published: (2025)
by: Liu, Hongwei, et al.
Published: (2025)
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
by: Fatemi, Bahare, et al.
Published: (2024)
by: Fatemi, Bahare, et al.
Published: (2024)
FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging
by: Tang, Zichen, et al.
Published: (2025)
by: Tang, Zichen, et al.
Published: (2025)
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
by: Sun, Jiaxing, et al.
Published: (2024)
by: Sun, Jiaxing, et al.
Published: (2024)
MMATH: A Multilingual Benchmark for Mathematical Reasoning
by: Luo, Wenyang, et al.
Published: (2025)
by: Luo, Wenyang, et al.
Published: (2025)
ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning
by: Potamitis, Nearchos, et al.
Published: (2025)
by: Potamitis, Nearchos, et al.
Published: (2025)
TRAM: Benchmarking Temporal Reasoning for Large Language Models
by: Wang, Yuqing, et al.
Published: (2023)
by: Wang, Yuqing, et al.
Published: (2023)
LongReasonArena: A Long Reasoning Benchmark for Large Language Models
by: Ding, Jiayu, et al.
Published: (2025)
by: Ding, Jiayu, et al.
Published: (2025)
StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models
by: Chen, Yongrui, et al.
Published: (2026)
by: Chen, Yongrui, et al.
Published: (2026)
MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation
by: Li, Xiaoyuan, et al.
Published: (2025)
by: Li, Xiaoyuan, et al.
Published: (2025)
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
by: Gao, Chongyang, et al.
Published: (2026)
by: Gao, Chongyang, et al.
Published: (2026)
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs
by: Chen, Hailin, et al.
Published: (2024)
by: Chen, Hailin, et al.
Published: (2024)
Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions
by: Hong, Zijin, et al.
Published: (2025)
by: Hong, Zijin, et al.
Published: (2025)
THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models
by: Pu, Xiao, et al.
Published: (2025)
by: Pu, Xiao, et al.
Published: (2025)
MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables
by: Marcuzzo, Matteo, et al.
Published: (2025)
by: Marcuzzo, Matteo, et al.
Published: (2025)
SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios
by: Zhan, Weidong, et al.
Published: (2025)
by: Zhan, Weidong, et al.
Published: (2025)
UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models
by: Xu, Xin, et al.
Published: (2025)
by: Xu, Xin, et al.
Published: (2025)
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models
by: Yan, Qianqi, et al.
Published: (2025)
by: Yan, Qianqi, et al.
Published: (2025)
Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties
by: Wang, Zhenglin, et al.
Published: (2025)
by: Wang, Zhenglin, et al.
Published: (2025)
AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs
by: Feng, Xiang, et al.
Published: (2025)
by: Feng, Xiang, et al.
Published: (2025)
EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models
by: Huang, Junquan, et al.
Published: (2025)
by: Huang, Junquan, et al.
Published: (2025)
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
by: Srivastava, Saurabh, et al.
Published: (2024)
by: Srivastava, Saurabh, et al.
Published: (2024)
EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems
by: Liu, Jingwen, et al.
Published: (2025)
by: Liu, Jingwen, et al.
Published: (2025)
RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
by: Han, Yunseok, et al.
Published: (2026)
by: Han, Yunseok, et al.
Published: (2026)
SUPERChem: A Multimodal Reasoning Benchmark in Chemistry
by: Zhao, Zehua, et al.
Published: (2025)
by: Zhao, Zehua, et al.
Published: (2025)
MIND Your Reasoning: A Meta-Cognitive Intuitive-Reflective Network for Dual-Reasoning in Multimodal Stance Detection
by: Wang, Bingbing, et al.
Published: (2025)
by: Wang, Bingbing, et al.
Published: (2025)
ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation
by: Oh, Jungwoo, et al.
Published: (2026)
by: Oh, Jungwoo, et al.
Published: (2026)
A Reasoning-Focused Legal Retrieval Benchmark
by: Zheng, Lucia, et al.
Published: (2025)
by: Zheng, Lucia, et al.
Published: (2025)
Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
by: Zheng, Xiang, et al.
Published: (2026)
by: Zheng, Xiang, et al.
Published: (2026)
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning
by: Cai, Zikui, et al.
Published: (2025)
by: Cai, Zikui, et al.
Published: (2025)
Are Your LLMs Capable of Stable Reasoning?
by: Liu, Junnan, et al.
Published: (2024)
by: Liu, Junnan, et al.
Published: (2024)
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models
by: Zhu, Yakun, et al.
Published: (2025)
by: Zhu, Yakun, et al.
Published: (2025)
Similar Items
-
Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
by: Vaishnav, Mohit, et al.
Published: (2026) -
Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective
by: Ma, Qingchuan, et al.
Published: (2025) -
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
by: Ma, David, et al.
Published: (2025) -
Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective
by: You, Wangjie, et al.
Published: (2025) -
PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
by: Li, Shaoxuan, et al.
Published: (2026)