Saved in:
Bibliographic Details
Main Authors: Zhan, Weidong, Wang, Yue, Hu, Nan, Xiao, Liming, Ma, Jingyuan, Qin, Yuhang, Li, Zheng, Yang, Yixin, Deng, Sirui, Ding, Jinkun, Ma, Wenhan, Li, Rui, Luo, Weilin, Liu, Qun, Sui, Zhifang
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.06218
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913843538558976
author Zhan, Weidong
Wang, Yue
Hu, Nan
Xiao, Liming
Ma, Jingyuan
Qin, Yuhang
Li, Zheng
Yang, Yixin
Deng, Sirui
Ding, Jinkun
Ma, Wenhan
Li, Rui
Luo, Weilin
Liu, Qun
Sui, Zhifang
author_facet Zhan, Weidong
Wang, Yue
Hu, Nan
Xiao, Liming
Ma, Jingyuan
Qin, Yuhang
Li, Zheng
Yang, Yixin
Deng, Sirui
Ding, Jinkun
Ma, Wenhan
Li, Rui
Luo, Weilin
Liu, Qun
Sui, Zhifang
contents Currently, long-chain reasoning remains a key challenge for large language models (LLMs) because natural texts lack sufficient explicit reasoning data. However, existing benchmarks suffer from limitations such as narrow coverage, short reasoning paths, or high construction costs. We introduce SCoRE (Scenario-based Commonsense Reasoning Evaluation), a benchmark that synthesizes multi-hop questions from scenario schemas of entities, relations, and logical rules to assess long-chain commonsense reasoning. SCoRE contains 100k bilingual (Chinese-English) multiple-choice questions whose reasoning chains span 2-11 hops and are grouped into various difficulty levels. Each question is accompanied by fine-grained knowledge labels, explicit reasoning chains, and difficulty levels for diagnostic evaluation. Evaluation results on cutting-edge LLMs such as o3-mini and Deepseek R1 shows that even the best model attains only 69.78% accuracy on SCoRE (even only 47.91% on the hard set), with errors often stemming from rare knowledge, logical inconsistency, and over-interpretation of simple questions. SCoRE offers a scalable, extensible framework for evaluating and diagnosing the long-chain commonsense reasoning abilities of LLMs and guiding future advances in model design and training.
format Preprint
id arxiv_https___arxiv_org_abs_2503_06218
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios
Zhan, Weidong
Wang, Yue
Hu, Nan
Xiao, Liming
Ma, Jingyuan
Qin, Yuhang
Li, Zheng
Yang, Yixin
Deng, Sirui
Ding, Jinkun
Ma, Wenhan
Li, Rui
Luo, Weilin
Liu, Qun
Sui, Zhifang
Computation and Language
Currently, long-chain reasoning remains a key challenge for large language models (LLMs) because natural texts lack sufficient explicit reasoning data. However, existing benchmarks suffer from limitations such as narrow coverage, short reasoning paths, or high construction costs. We introduce SCoRE (Scenario-based Commonsense Reasoning Evaluation), a benchmark that synthesizes multi-hop questions from scenario schemas of entities, relations, and logical rules to assess long-chain commonsense reasoning. SCoRE contains 100k bilingual (Chinese-English) multiple-choice questions whose reasoning chains span 2-11 hops and are grouped into various difficulty levels. Each question is accompanied by fine-grained knowledge labels, explicit reasoning chains, and difficulty levels for diagnostic evaluation. Evaluation results on cutting-edge LLMs such as o3-mini and Deepseek R1 shows that even the best model attains only 69.78% accuracy on SCoRE (even only 47.91% on the hard set), with errors often stemming from rare knowledge, logical inconsistency, and over-interpretation of simple questions. SCoRE offers a scalable, extensible framework for evaluating and diagnosing the long-chain commonsense reasoning abilities of LLMs and guiding future advances in model design and training.
title SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios
topic Computation and Language
url https://arxiv.org/abs/2503.06218