Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhan, Weidong, Wang, Yue, Hu, Nan, Xiao, Liming, Ma, Jingyuan, Qin, Yuhang, Li, Zheng, Yang, Yixin, Deng, Sirui, Ding, Jinkun, Ma, Wenhan, Li, Rui, Luo, Weilin, Liu, Qun, Sui, Zhifang
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2503.06218
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913843538558976
author	Zhan, Weidong Wang, Yue Hu, Nan Xiao, Liming Ma, Jingyuan Qin, Yuhang Li, Zheng Yang, Yixin Deng, Sirui Ding, Jinkun Ma, Wenhan Li, Rui Luo, Weilin Liu, Qun Sui, Zhifang
author_facet	Zhan, Weidong Wang, Yue Hu, Nan Xiao, Liming Ma, Jingyuan Qin, Yuhang Li, Zheng Yang, Yixin Deng, Sirui Ding, Jinkun Ma, Wenhan Li, Rui Luo, Weilin Liu, Qun Sui, Zhifang
contents	Currently, long-chain reasoning remains a key challenge for large language models (LLMs) because natural texts lack sufficient explicit reasoning data. However, existing benchmarks suffer from limitations such as narrow coverage, short reasoning paths, or high construction costs. We introduce SCoRE (Scenario-based Commonsense Reasoning Evaluation), a benchmark that synthesizes multi-hop questions from scenario schemas of entities, relations, and logical rules to assess long-chain commonsense reasoning. SCoRE contains 100k bilingual (Chinese-English) multiple-choice questions whose reasoning chains span 2-11 hops and are grouped into various difficulty levels. Each question is accompanied by fine-grained knowledge labels, explicit reasoning chains, and difficulty levels for diagnostic evaluation. Evaluation results on cutting-edge LLMs such as o3-mini and Deepseek R1 shows that even the best model attains only 69.78% accuracy on SCoRE (even only 47.91% on the hard set), with errors often stemming from rare knowledge, logical inconsistency, and over-interpretation of simple questions. SCoRE offers a scalable, extensible framework for evaluating and diagnosing the long-chain commonsense reasoning abilities of LLMs and guiding future advances in model design and training.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_06218
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios Zhan, Weidong Wang, Yue Hu, Nan Xiao, Liming Ma, Jingyuan Qin, Yuhang Li, Zheng Yang, Yixin Deng, Sirui Ding, Jinkun Ma, Wenhan Li, Rui Luo, Weilin Liu, Qun Sui, Zhifang Computation and Language Currently, long-chain reasoning remains a key challenge for large language models (LLMs) because natural texts lack sufficient explicit reasoning data. However, existing benchmarks suffer from limitations such as narrow coverage, short reasoning paths, or high construction costs. We introduce SCoRE (Scenario-based Commonsense Reasoning Evaluation), a benchmark that synthesizes multi-hop questions from scenario schemas of entities, relations, and logical rules to assess long-chain commonsense reasoning. SCoRE contains 100k bilingual (Chinese-English) multiple-choice questions whose reasoning chains span 2-11 hops and are grouped into various difficulty levels. Each question is accompanied by fine-grained knowledge labels, explicit reasoning chains, and difficulty levels for diagnostic evaluation. Evaluation results on cutting-edge LLMs such as o3-mini and Deepseek R1 shows that even the best model attains only 69.78% accuracy on SCoRE (even only 47.91% on the hard set), with errors often stemming from rare knowledge, logical inconsistency, and over-interpretation of simple questions. SCoRE offers a scalable, extensible framework for evaluating and diagnosing the long-chain commonsense reasoning abilities of LLMs and guiding future advances in model design and training.
title	SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios
topic	Computation and Language
url	https://arxiv.org/abs/2503.06218

Similar Items