Saved in:
Bibliographic Details
Main Authors: Cheng, Qi, Boratko, Michael, Yelugam, Pranay Kumar, O'Gorman, Tim, Singh, Nalini, McCallum, Andrew, Li, Xiang Lorraine
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.04145
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929376241647616
author Cheng, Qi
Boratko, Michael
Yelugam, Pranay Kumar
O'Gorman, Tim
Singh, Nalini
McCallum, Andrew
Li, Xiang Lorraine
author_facet Cheng, Qi
Boratko, Michael
Yelugam, Pranay Kumar
O'Gorman, Tim
Singh, Nalini
McCallum, Andrew
Li, Xiang Lorraine
contents Large language models have demonstrated impressive performance on commonsense tasks; however, these tasks are often posed as multiple-choice questions, allowing models to exploit systematic biases. Commonsense is also inherently probabilistic with multiple correct answers. The purpose of "boiling water" could be making tea and cooking, but it also could be killing germs. Existing tasks do not capture the probabilistic nature of common sense. To this end, we present commonsense frame completion (CFC), a new generative task that evaluates common sense via multiple open-ended generations. We also propose a method of probabilistic evaluation that strongly correlates with human judgments. Humans drastically outperform strong language model baselines on our dataset, indicating this approach is both a challenging and useful evaluation of machine common sense.
format Preprint
id arxiv_https___arxiv_org_abs_2406_04145
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Every Answer Matters: Evaluating Commonsense with Probabilistic Measures
Cheng, Qi
Boratko, Michael
Yelugam, Pranay Kumar
O'Gorman, Tim
Singh, Nalini
McCallum, Andrew
Li, Xiang Lorraine
Computation and Language
Artificial Intelligence
Large language models have demonstrated impressive performance on commonsense tasks; however, these tasks are often posed as multiple-choice questions, allowing models to exploit systematic biases. Commonsense is also inherently probabilistic with multiple correct answers. The purpose of "boiling water" could be making tea and cooking, but it also could be killing germs. Existing tasks do not capture the probabilistic nature of common sense. To this end, we present commonsense frame completion (CFC), a new generative task that evaluates common sense via multiple open-ended generations. We also propose a method of probabilistic evaluation that strongly correlates with human judgments. Humans drastically outperform strong language model baselines on our dataset, indicating this approach is both a challenging and useful evaluation of machine common sense.
title Every Answer Matters: Evaluating Commonsense with Probabilistic Measures
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2406.04145