_version_ 1866913934074707968
author Orsholm, Johanna
Quinto, John
Autto, Hannu
Banelyte, Gaia
Chazot, Nicolas
deWaard, Jeremy
deWaard, Stephanie
Farrell, Arielle
Furneaux, Brendan
Hardwick, Bess
Ito, Nao
Kar, Amlan
Kalttopää, Oula
Kerdraon, Deirdre
Kristensen, Erik
McKeown, Jaclyn
Mononen, Tommi
Nein, Ellen
Rogers, Hanna
Roslin, Tomas
Schmitz, Paula
Sones, Jayme
Sujala, Maija
Thompson, Amy
Zakharov, Evgeny V.
Zarubiieva, Iuliia
Gupta, Akshita
Lowe, Scott C.
Taylor, Graham W.
author_facet Orsholm, Johanna
Quinto, John
Autto, Hannu
Banelyte, Gaia
Chazot, Nicolas
deWaard, Jeremy
deWaard, Stephanie
Farrell, Arielle
Furneaux, Brendan
Hardwick, Bess
Ito, Nao
Kar, Amlan
Kalttopää, Oula
Kerdraon, Deirdre
Kristensen, Erik
McKeown, Jaclyn
Mononen, Tommi
Nein, Ellen
Rogers, Hanna
Roslin, Tomas
Schmitz, Paula
Sones, Jayme
Sujala, Maija
Thompson, Amy
Zakharov, Evgeny V.
Zarubiieva, Iuliia
Gupta, Akshita
Lowe, Scott C.
Taylor, Graham W.
contents Insects comprise millions of species, many experiencing severe population declines under environmental and habitat changes. High-throughput approaches are crucial for accelerating our understanding of insect diversity, with DNA barcoding and high-resolution imaging showing strong potential for automatic taxonomic classification. However, most image-based approaches rely on individual specimen data, unlike the unsorted bulk samples collected in large-scale ecological surveys. We present the Mixed Arthropod Sample Segmentation and Identification (MassID45) dataset for training automatic classifiers of bulk insect samples. It uniquely combines molecular and imaging data at both the unsorted sample level and the full set of individual specimens. Human annotators, supported by an AI-assisted tool, performed two tasks on bulk images: creating segmentation masks around each individual arthropod and assigning taxonomic labels to over 17 000 specimens. Combining the taxonomic resolution of DNA barcodes with precise abundance estimates of bulk images holds great potential for rapid, large-scale characterization of insect communities. This dataset pushes the boundaries of tiny object detection and instance segmentation, fostering innovation in both ecological and machine learning research.
format Preprint
id arxiv_https___arxiv_org_abs_2507_06972
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level
Orsholm, Johanna
Quinto, John
Autto, Hannu
Banelyte, Gaia
Chazot, Nicolas
deWaard, Jeremy
deWaard, Stephanie
Farrell, Arielle
Furneaux, Brendan
Hardwick, Bess
Ito, Nao
Kar, Amlan
Kalttopää, Oula
Kerdraon, Deirdre
Kristensen, Erik
McKeown, Jaclyn
Mononen, Tommi
Nein, Ellen
Rogers, Hanna
Roslin, Tomas
Schmitz, Paula
Sones, Jayme
Sujala, Maija
Thompson, Amy
Zakharov, Evgeny V.
Zarubiieva, Iuliia
Gupta, Akshita
Lowe, Scott C.
Taylor, Graham W.
Computer Vision and Pattern Recognition
Insects comprise millions of species, many experiencing severe population declines under environmental and habitat changes. High-throughput approaches are crucial for accelerating our understanding of insect diversity, with DNA barcoding and high-resolution imaging showing strong potential for automatic taxonomic classification. However, most image-based approaches rely on individual specimen data, unlike the unsorted bulk samples collected in large-scale ecological surveys. We present the Mixed Arthropod Sample Segmentation and Identification (MassID45) dataset for training automatic classifiers of bulk insect samples. It uniquely combines molecular and imaging data at both the unsorted sample level and the full set of individual specimens. Human annotators, supported by an AI-assisted tool, performed two tasks on bulk images: creating segmentation masks around each individual arthropod and assigning taxonomic labels to over 17 000 specimens. Combining the taxonomic resolution of DNA barcodes with precise abundance estimates of bulk images holds great potential for rapid, large-scale characterization of insect communities. This dataset pushes the boundaries of tiny object detection and instance segmentation, fostering innovation in both ecological and machine learning research.
title A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2507.06972