Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cheng, Qi, Boratko, Michael, Yelugam, Pranay Kumar, O'Gorman, Tim, Singh, Nalini, McCallum, Andrew, Li, Xiang Lorraine
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2406.04145
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929376241647616
author	Cheng, Qi Boratko, Michael Yelugam, Pranay Kumar O'Gorman, Tim Singh, Nalini McCallum, Andrew Li, Xiang Lorraine
author_facet	Cheng, Qi Boratko, Michael Yelugam, Pranay Kumar O'Gorman, Tim Singh, Nalini McCallum, Andrew Li, Xiang Lorraine
contents	Large language models have demonstrated impressive performance on commonsense tasks; however, these tasks are often posed as multiple-choice questions, allowing models to exploit systematic biases. Commonsense is also inherently probabilistic with multiple correct answers. The purpose of "boiling water" could be making tea and cooking, but it also could be killing germs. Existing tasks do not capture the probabilistic nature of common sense. To this end, we present commonsense frame completion (CFC), a new generative task that evaluates common sense via multiple open-ended generations. We also propose a method of probabilistic evaluation that strongly correlates with human judgments. Humans drastically outperform strong language model baselines on our dataset, indicating this approach is both a challenging and useful evaluation of machine common sense.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_04145
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Every Answer Matters: Evaluating Commonsense with Probabilistic Measures Cheng, Qi Boratko, Michael Yelugam, Pranay Kumar O'Gorman, Tim Singh, Nalini McCallum, Andrew Li, Xiang Lorraine Computation and Language Artificial Intelligence Large language models have demonstrated impressive performance on commonsense tasks; however, these tasks are often posed as multiple-choice questions, allowing models to exploit systematic biases. Commonsense is also inherently probabilistic with multiple correct answers. The purpose of "boiling water" could be making tea and cooking, but it also could be killing germs. Existing tasks do not capture the probabilistic nature of common sense. To this end, we present commonsense frame completion (CFC), a new generative task that evaluates common sense via multiple open-ended generations. We also propose a method of probabilistic evaluation that strongly correlates with human judgments. Humans drastically outperform strong language model baselines on our dataset, indicating this approach is both a challenging and useful evaluation of machine common sense.
title	Every Answer Matters: Evaluating Commonsense with Probabilistic Measures
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2406.04145

Similar Items