Saved in:
Bibliographic Details
Main Authors: Liu, Weitang, Li, Ying Wai, Li, Yuelei, Wang, Zihan, You, Yi-Zhuang, Shang, Jingbo
Format: Preprint
Published: 2023
Subjects:
Online Access:https://arxiv.org/abs/2312.03291
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913574683672576
author Liu, Weitang
Li, Ying Wai
Li, Yuelei
Wang, Zihan
You, Yi-Zhuang
Shang, Jingbo
author_facet Liu, Weitang
Li, Ying Wai
Li, Yuelei
Wang, Zihan
You, Yi-Zhuang
Shang, Jingbo
contents Evaluating models on datasets often fails to capture their behavior when faced with unexpected and diverse types of inputs. It would be beneficial if we could evaluate the difference between human annotation and model prediction for an internet number of inputs, or more generally, for an input space that enumeration is computationally impractical. Traditional model evaluation methods rely on precision and recall (PR) as metrics, which are typically estimated by comparing human annotations with model predictions on a specific dataset. This is feasible because enumerating thousands of test inputs is manageable. However, estimating PR across a large input space is challenging because enumeration becomes computationally infeasible. We propose OmniInput, a novel approach to evaluate and compare NNs by the PR of an input space. OmniInput is distinctive from previous works as its estimated PR reflects the estimation of the differences between human annotation and model prediction in the input space which is usually too huge to be enumerated. We empirically validate our method within an enumerable input space, and our experiments demonstrate that OmniInput can effectively estimate and compare precision and recall for (large) language models within a broad input space that is not enumerable.
format Preprint
id arxiv_https___arxiv_org_abs_2312_03291
institution arXiv
publishDate 2023
record_format arxiv
spellingShingle Evaluation of human-model prediction difference on the Internet Scale of Data
Liu, Weitang
Li, Ying Wai
Li, Yuelei
Wang, Zihan
You, Yi-Zhuang
Shang, Jingbo
Machine Learning
Artificial Intelligence
Evaluating models on datasets often fails to capture their behavior when faced with unexpected and diverse types of inputs. It would be beneficial if we could evaluate the difference between human annotation and model prediction for an internet number of inputs, or more generally, for an input space that enumeration is computationally impractical. Traditional model evaluation methods rely on precision and recall (PR) as metrics, which are typically estimated by comparing human annotations with model predictions on a specific dataset. This is feasible because enumerating thousands of test inputs is manageable. However, estimating PR across a large input space is challenging because enumeration becomes computationally infeasible. We propose OmniInput, a novel approach to evaluate and compare NNs by the PR of an input space. OmniInput is distinctive from previous works as its estimated PR reflects the estimation of the differences between human annotation and model prediction in the input space which is usually too huge to be enumerated. We empirically validate our method within an enumerable input space, and our experiments demonstrate that OmniInput can effectively estimate and compare precision and recall for (large) language models within a broad input space that is not enumerable.
title Evaluation of human-model prediction difference on the Internet Scale of Data
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2312.03291