Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lam, Kevin Fu Yuan, Gopal, Vikneswaran, Qian, Jiang
Format:	Preprint
Published:	2023
Subjects:	Methodology
Online Access:	https://arxiv.org/abs/2309.14621
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909219487219712
author	Lam, Kevin Fu Yuan Gopal, Vikneswaran Qian, Jiang
author_facet	Lam, Kevin Fu Yuan Gopal, Vikneswaran Qian, Jiang
contents	In Natural Language Processing (NLP), binary classification algorithms are often evaluated using the F1 score. Because the sample F1 score is an estimate of the population F1 score, it is not sufficient to report the sample F1 score without an indication of how accurate it is. Confidence intervals are an indication of how accurate the sample F1 score is. However, most studies either do not report them or report them using methods that demonstrate poor statistical properties. In the present study, I review current analytical methods (i.e., Clopper-Pearson method and Wald method) to construct confidence intervals for the population F1 score, propose two new analytical methods (i.e., Wilson direct method and Wilson indirect method) to do so, and compare these methods based on their coverage probabilities and interval lengths, as well as whether these methods suffer from overshoot and degeneracy. Theoretical results demonstrate that both proposed methods do not suffer from overshoot and degeneracy. Experimental results suggest that both proposed methods perform better, as compared to current methods, in terms of coverage probabilities and interval lengths. I illustrate both current and proposed methods on two suggestion mining tasks. I discuss the practical implications of these results, and suggest areas for future research.
format	Preprint
id	arxiv_https___arxiv_org_abs_2309_14621
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	Confidence Intervals for the F1 Score: A Comparison of Four Methods Lam, Kevin Fu Yuan Gopal, Vikneswaran Qian, Jiang Methodology In Natural Language Processing (NLP), binary classification algorithms are often evaluated using the F1 score. Because the sample F1 score is an estimate of the population F1 score, it is not sufficient to report the sample F1 score without an indication of how accurate it is. Confidence intervals are an indication of how accurate the sample F1 score is. However, most studies either do not report them or report them using methods that demonstrate poor statistical properties. In the present study, I review current analytical methods (i.e., Clopper-Pearson method and Wald method) to construct confidence intervals for the population F1 score, propose two new analytical methods (i.e., Wilson direct method and Wilson indirect method) to do so, and compare these methods based on their coverage probabilities and interval lengths, as well as whether these methods suffer from overshoot and degeneracy. Theoretical results demonstrate that both proposed methods do not suffer from overshoot and degeneracy. Experimental results suggest that both proposed methods perform better, as compared to current methods, in terms of coverage probabilities and interval lengths. I illustrate both current and proposed methods on two suggestion mining tasks. I discuss the practical implications of these results, and suggest areas for future research.
title	Confidence Intervals for the F1 Score: A Comparison of Four Methods
topic	Methodology
url	https://arxiv.org/abs/2309.14621

Similar Items