Saved in:
Bibliographic Details
Main Authors: Bateux, Quentin, Koss, Jonathan, Sweeney, Patrick W., Edwards, Erika, Rios, Nelson, Dollar, Aaron M.
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2411.10074
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929600145129472
author Bateux, Quentin
Koss, Jonathan
Sweeney, Patrick W.
Edwards, Erika
Rios, Nelson
Dollar, Aaron M.
author_facet Bateux, Quentin
Koss, Jonathan
Sweeney, Patrick W.
Edwards, Erika
Rios, Nelson
Dollar, Aaron M.
contents The digitization of natural history collections over the past three decades has unlocked a treasure trove of specimen imagery and metadata. There is great interest in making this data more useful by further labeling it with additional trait data, and modern deep learning machine learning techniques utilizing convolutional neural nets (CNNs) and similar networks show particular promise to reduce the amount of required manual labeling by human experts, making the process much faster and less expensive. However, in most cases, the accuracy of these approaches is too low for reliable utilization of the automatic labeling, typically in the range of 80-85% accuracy. In this paper, we present and validate an approach that can greatly improve this accuracy, essentially by examining the confidence that the network has in the generated label as well as utilizing a user-defined threshold to reject labels that fall below a chosen level. We demonstrate that a naive model that produced 86% initial accuracy can achieve improved performance - over 95% accuracy (rejecting about 40% of the labels) or over 99% accuracy (rejecting about 65%) by selecting higher confidence thresholds. This gives flexibility to adapt existing models to the statistical requirements of various types of research and has the potential to move these automatic labeling approaches from being unusably inaccurate to being an invaluable new tool. After validating the approach in a number of ways, we annotate the reproductive state of a large dataset of over 600,000 herbarium specimens. The analysis of the results points at under-investigated correlations as well as general alignment with known trends. By sharing this new dataset alongside this work, we want to allow ecologists to gather insights for their own research questions, at their chosen point of accuracy/coverage trade-off.
format Preprint
id arxiv_https___arxiv_org_abs_2411_10074
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Improving the accuracy of automated labeling of specimen images datasets via a confidence-based process
Bateux, Quentin
Koss, Jonathan
Sweeney, Patrick W.
Edwards, Erika
Rios, Nelson
Dollar, Aaron M.
Computer Vision and Pattern Recognition
Populations and Evolution
The digitization of natural history collections over the past three decades has unlocked a treasure trove of specimen imagery and metadata. There is great interest in making this data more useful by further labeling it with additional trait data, and modern deep learning machine learning techniques utilizing convolutional neural nets (CNNs) and similar networks show particular promise to reduce the amount of required manual labeling by human experts, making the process much faster and less expensive. However, in most cases, the accuracy of these approaches is too low for reliable utilization of the automatic labeling, typically in the range of 80-85% accuracy. In this paper, we present and validate an approach that can greatly improve this accuracy, essentially by examining the confidence that the network has in the generated label as well as utilizing a user-defined threshold to reject labels that fall below a chosen level. We demonstrate that a naive model that produced 86% initial accuracy can achieve improved performance - over 95% accuracy (rejecting about 40% of the labels) or over 99% accuracy (rejecting about 65%) by selecting higher confidence thresholds. This gives flexibility to adapt existing models to the statistical requirements of various types of research and has the potential to move these automatic labeling approaches from being unusably inaccurate to being an invaluable new tool. After validating the approach in a number of ways, we annotate the reproductive state of a large dataset of over 600,000 herbarium specimens. The analysis of the results points at under-investigated correlations as well as general alignment with known trends. By sharing this new dataset alongside this work, we want to allow ecologists to gather insights for their own research questions, at their chosen point of accuracy/coverage trade-off.
title Improving the accuracy of automated labeling of specimen images datasets via a confidence-based process
topic Computer Vision and Pattern Recognition
Populations and Evolution
url https://arxiv.org/abs/2411.10074