Saved in:
Bibliographic Details
Main Authors: Lam, Kevin Fu Yuan, Gopal, Vikneswaran, Qian, Jiang
Format: Preprint
Published: 2023
Subjects:
Online Access:https://arxiv.org/abs/2309.14621
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909219487219712
author Lam, Kevin Fu Yuan
Gopal, Vikneswaran
Qian, Jiang
author_facet Lam, Kevin Fu Yuan
Gopal, Vikneswaran
Qian, Jiang
contents In Natural Language Processing (NLP), binary classification algorithms are often evaluated using the F1 score. Because the sample F1 score is an estimate of the population F1 score, it is not sufficient to report the sample F1 score without an indication of how accurate it is. Confidence intervals are an indication of how accurate the sample F1 score is. However, most studies either do not report them or report them using methods that demonstrate poor statistical properties. In the present study, I review current analytical methods (i.e., Clopper-Pearson method and Wald method) to construct confidence intervals for the population F1 score, propose two new analytical methods (i.e., Wilson direct method and Wilson indirect method) to do so, and compare these methods based on their coverage probabilities and interval lengths, as well as whether these methods suffer from overshoot and degeneracy. Theoretical results demonstrate that both proposed methods do not suffer from overshoot and degeneracy. Experimental results suggest that both proposed methods perform better, as compared to current methods, in terms of coverage probabilities and interval lengths. I illustrate both current and proposed methods on two suggestion mining tasks. I discuss the practical implications of these results, and suggest areas for future research.
format Preprint
id arxiv_https___arxiv_org_abs_2309_14621
institution arXiv
publishDate 2023
record_format arxiv
spellingShingle Confidence Intervals for the F1 Score: A Comparison of Four Methods
Lam, Kevin Fu Yuan
Gopal, Vikneswaran
Qian, Jiang
Methodology
In Natural Language Processing (NLP), binary classification algorithms are often evaluated using the F1 score. Because the sample F1 score is an estimate of the population F1 score, it is not sufficient to report the sample F1 score without an indication of how accurate it is. Confidence intervals are an indication of how accurate the sample F1 score is. However, most studies either do not report them or report them using methods that demonstrate poor statistical properties. In the present study, I review current analytical methods (i.e., Clopper-Pearson method and Wald method) to construct confidence intervals for the population F1 score, propose two new analytical methods (i.e., Wilson direct method and Wilson indirect method) to do so, and compare these methods based on their coverage probabilities and interval lengths, as well as whether these methods suffer from overshoot and degeneracy. Theoretical results demonstrate that both proposed methods do not suffer from overshoot and degeneracy. Experimental results suggest that both proposed methods perform better, as compared to current methods, in terms of coverage probabilities and interval lengths. I illustrate both current and proposed methods on two suggestion mining tasks. I discuss the practical implications of these results, and suggest areas for future research.
title Confidence Intervals for the F1 Score: A Comparison of Four Methods
topic Methodology
url https://arxiv.org/abs/2309.14621