:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Xinhe, Huang, Jin, Zhang, Xingjian, Wang, Tianhao, Ma, Jiaqi W.
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2512.21329
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
by: Vaishnav, Mohit, et al.
Published: (2026)

Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective
by: Ma, Qingchuan, et al.
Published: (2025)

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
by: Ma, David, et al.
Published: (2025)

Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective
by: You, Wangjie, et al.
Published: (2025)

PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
by: Li, Shaoxuan, et al.
Published: (2026)

UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models
by: Xu, Xin, et al.
Published: (2025)

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
by: Zhao, Bingchen, et al.
Published: (2024)

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist
by: Zhou, Zihao, et al.
Published: (2024)

ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
by: Liu, Hongwei, et al.
Published: (2025)

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
by: Fatemi, Bahare, et al.
Published: (2024)

FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging
by: Tang, Zichen, et al.
Published: (2025)

Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
by: Sun, Jiaxing, et al.
Published: (2024)

MMATH: A Multilingual Benchmark for Mathematical Reasoning
by: Luo, Wenyang, et al.
Published: (2025)

ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning
by: Potamitis, Nearchos, et al.
Published: (2025)

TRAM: Benchmarking Temporal Reasoning for Large Language Models
by: Wang, Yuqing, et al.
Published: (2023)

LongReasonArena: A Long Reasoning Benchmark for Large Language Models
by: Ding, Jiayu, et al.
Published: (2025)

StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models
by: Chen, Yongrui, et al.
Published: (2026)

MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation
by: Li, Xiaoyuan, et al.
Published: (2025)

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
by: Gao, Chongyang, et al.
Published: (2026)

StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs
by: Chen, Hailin, et al.
Published: (2024)

Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions
by: Hong, Zijin, et al.
Published: (2025)

THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models
by: Pu, Xiao, et al.
Published: (2025)

MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables
by: Marcuzzo, Matteo, et al.
Published: (2025)

SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios
by: Zhan, Weidong, et al.
Published: (2025)

UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models
by: Xu, Xin, et al.
Published: (2025)

Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models
by: Yan, Qianqi, et al.
Published: (2025)

Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties
by: Wang, Zhenglin, et al.
Published: (2025)

AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs
by: Feng, Xiang, et al.
Published: (2025)

EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models
by: Huang, Junquan, et al.
Published: (2025)

Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
by: Srivastava, Saurabh, et al.
Published: (2024)

EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems
by: Liu, Jingwen, et al.
Published: (2025)

RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
by: Han, Yunseok, et al.
Published: (2026)

SUPERChem: A Multimodal Reasoning Benchmark in Chemistry
by: Zhao, Zehua, et al.
Published: (2025)

MIND Your Reasoning: A Meta-Cognitive Intuitive-Reflective Network for Dual-Reasoning in Multimodal Stance Detection
by: Wang, Bingbing, et al.
Published: (2025)

ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation
by: Oh, Jungwoo, et al.
Published: (2026)

A Reasoning-Focused Legal Retrieval Benchmark
by: Zheng, Lucia, et al.
Published: (2025)

Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
by: Zheng, Xiang, et al.
Published: (2026)

MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning
by: Cai, Zikui, et al.
Published: (2025)

Are Your LLMs Capable of Stable Reasoning?
by: Liu, Junnan, et al.
Published: (2024)

DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models
by: Zhu, Yakun, et al.
Published: (2025)