Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Bateux, Quentin, Koss, Jonathan, Sweeney, Patrick W., Edwards, Erika, Rios, Nelson, Dollar, Aaron M.
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Populations and Evolution
Online Access:	https://arxiv.org/abs/2411.10074
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929600145129472
author	Bateux, Quentin Koss, Jonathan Sweeney, Patrick W. Edwards, Erika Rios, Nelson Dollar, Aaron M.
author_facet	Bateux, Quentin Koss, Jonathan Sweeney, Patrick W. Edwards, Erika Rios, Nelson Dollar, Aaron M.
contents	The digitization of natural history collections over the past three decades has unlocked a treasure trove of specimen imagery and metadata. There is great interest in making this data more useful by further labeling it with additional trait data, and modern deep learning machine learning techniques utilizing convolutional neural nets (CNNs) and similar networks show particular promise to reduce the amount of required manual labeling by human experts, making the process much faster and less expensive. However, in most cases, the accuracy of these approaches is too low for reliable utilization of the automatic labeling, typically in the range of 80-85% accuracy. In this paper, we present and validate an approach that can greatly improve this accuracy, essentially by examining the confidence that the network has in the generated label as well as utilizing a user-defined threshold to reject labels that fall below a chosen level. We demonstrate that a naive model that produced 86% initial accuracy can achieve improved performance - over 95% accuracy (rejecting about 40% of the labels) or over 99% accuracy (rejecting about 65%) by selecting higher confidence thresholds. This gives flexibility to adapt existing models to the statistical requirements of various types of research and has the potential to move these automatic labeling approaches from being unusably inaccurate to being an invaluable new tool. After validating the approach in a number of ways, we annotate the reproductive state of a large dataset of over 600,000 herbarium specimens. The analysis of the results points at under-investigated correlations as well as general alignment with known trends. By sharing this new dataset alongside this work, we want to allow ecologists to gather insights for their own research questions, at their chosen point of accuracy/coverage trade-off.
format	Preprint
id	arxiv_https___arxiv_org_abs_2411_10074
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Improving the accuracy of automated labeling of specimen images datasets via a confidence-based process Bateux, Quentin Koss, Jonathan Sweeney, Patrick W. Edwards, Erika Rios, Nelson Dollar, Aaron M. Computer Vision and Pattern Recognition Populations and Evolution The digitization of natural history collections over the past three decades has unlocked a treasure trove of specimen imagery and metadata. There is great interest in making this data more useful by further labeling it with additional trait data, and modern deep learning machine learning techniques utilizing convolutional neural nets (CNNs) and similar networks show particular promise to reduce the amount of required manual labeling by human experts, making the process much faster and less expensive. However, in most cases, the accuracy of these approaches is too low for reliable utilization of the automatic labeling, typically in the range of 80-85% accuracy. In this paper, we present and validate an approach that can greatly improve this accuracy, essentially by examining the confidence that the network has in the generated label as well as utilizing a user-defined threshold to reject labels that fall below a chosen level. We demonstrate that a naive model that produced 86% initial accuracy can achieve improved performance - over 95% accuracy (rejecting about 40% of the labels) or over 99% accuracy (rejecting about 65%) by selecting higher confidence thresholds. This gives flexibility to adapt existing models to the statistical requirements of various types of research and has the potential to move these automatic labeling approaches from being unusably inaccurate to being an invaluable new tool. After validating the approach in a number of ways, we annotate the reproductive state of a large dataset of over 600,000 herbarium specimens. The analysis of the results points at under-investigated correlations as well as general alignment with known trends. By sharing this new dataset alongside this work, we want to allow ecologists to gather insights for their own research questions, at their chosen point of accuracy/coverage trade-off.
title	Improving the accuracy of automated labeling of specimen images datasets via a confidence-based process
topic	Computer Vision and Pattern Recognition Populations and Evolution
url	https://arxiv.org/abs/2411.10074

Similar Items