Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2021
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2106.01254 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Table of Contents:
- In many classification tasks, there is no definitive ground truth, only human judgments that may disagree. We address two challenges that arise in such settings: (1) how to use human raters to score classifiers, and (2) how to use them for comparison benchmarks. For the first, the common practice is to score classifiers against the majority vote of an evaluation panel of several human raters. We argue that this is not justified when either of two properties fails: objectivity or equanimity. Instead, under a utility model appropriate for such settings, scoring against one rater at a time and averaging the scores across raters is a more principled approach. For the second, we introduce the concept of rater equivalence: the smallest number of human raters whose combined judgment matches the classifier's performance. We provide a provably optimal algorithm for combining benchmark panel labels, and demonstrate the framework through case studies.