Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Han, Simeng, Yu, Aaron, Shen, Rui, Qi, Zhenting, Riddell, Martin, Zhou, Wenfei, Qiao, Yujie, Zhao, Yilun, Yavuz, Semih, Liu, Ye, Joty, Shafiq, Zhou, Yingbo, Xiong, Caiming, Radev, Dragomir, Ying, Rex, Cohan, Arman
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2410.09207
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917801237676032
author	Han, Simeng Yu, Aaron Shen, Rui Qi, Zhenting Riddell, Martin Zhou, Wenfei Qiao, Yujie Zhao, Yilun Yavuz, Semih Liu, Ye Joty, Shafiq Zhou, Yingbo Xiong, Caiming Radev, Dragomir Ying, Rex Cohan, Arman
author_facet	Han, Simeng Yu, Aaron Shen, Rui Qi, Zhenting Riddell, Martin Zhou, Wenfei Qiao, Yujie Zhao, Yilun Yavuz, Semih Liu, Ye Joty, Shafiq Zhou, Yingbo Xiong, Caiming Radev, Dragomir Ying, Rex Cohan, Arman
contents	Existing methods on understanding the capabilities of LLMs in logical reasoning rely on binary entailment classification or synthetically derived rationales, which are not sufficient for proper investigation of model's capabilities. We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains for a set of realistic logical reasoning stories also written by humans. P-FOLIO is collected with an annotation protocol that facilitates humans to annotate well-structured natural language proofs for first-order logic reasoning problems in a step-by-step manner. The number of reasoning steps in P-FOLIO span from 0 to 20. We further use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities. We evaluate LLM reasoning capabilities at a fine granularity via single-step inference rule classification, with more diverse inference rules of more diverse and higher levels of complexities than previous works. Given that a single model-generated reasoning chain could take a completely different path than the human-annotated one, we sample multiple reasoning chains from a model and use pass@k metrics for evaluating the quality of model-generated reasoning chains. We show that human-written reasoning chains significantly boost the logical reasoning capabilities of LLMs via many-shot prompting and fine-tuning. Furthermore, fine-tuning Llama3-7B on P-FOLIO improves the model performance by 10% or more on three other out-of-domain logical reasoning datasets. We also conduct detailed analysis to show where most powerful LLMs fall short in reasoning. We will release the dataset and code publicly.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_09207
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains Han, Simeng Yu, Aaron Shen, Rui Qi, Zhenting Riddell, Martin Zhou, Wenfei Qiao, Yujie Zhao, Yilun Yavuz, Semih Liu, Ye Joty, Shafiq Zhou, Yingbo Xiong, Caiming Radev, Dragomir Ying, Rex Cohan, Arman Artificial Intelligence Computation and Language Existing methods on understanding the capabilities of LLMs in logical reasoning rely on binary entailment classification or synthetically derived rationales, which are not sufficient for proper investigation of model's capabilities. We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains for a set of realistic logical reasoning stories also written by humans. P-FOLIO is collected with an annotation protocol that facilitates humans to annotate well-structured natural language proofs for first-order logic reasoning problems in a step-by-step manner. The number of reasoning steps in P-FOLIO span from 0 to 20. We further use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities. We evaluate LLM reasoning capabilities at a fine granularity via single-step inference rule classification, with more diverse inference rules of more diverse and higher levels of complexities than previous works. Given that a single model-generated reasoning chain could take a completely different path than the human-annotated one, we sample multiple reasoning chains from a model and use pass@k metrics for evaluating the quality of model-generated reasoning chains. We show that human-written reasoning chains significantly boost the logical reasoning capabilities of LLMs via many-shot prompting and fine-tuning. Furthermore, fine-tuning Llama3-7B on P-FOLIO improves the model performance by 10% or more on three other out-of-domain logical reasoning datasets. We also conduct detailed analysis to show where most powerful LLMs fall short in reasoning. We will release the dataset and code publicly.
title	P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains
topic	Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2410.09207

Similar Items