Saved in:
Bibliographic Details
Main Authors: Atuhurra, Jesse, Dujohn, Seiveright Cargill, Kamigaito, Hidetaka, Shindo, Hiroyuki, Watanabe, Taro
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2403.15430
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910379290918912
author Atuhurra, Jesse
Dujohn, Seiveright Cargill
Kamigaito, Hidetaka
Shindo, Hiroyuki
Watanabe, Taro
author_facet Atuhurra, Jesse
Dujohn, Seiveright Cargill
Kamigaito, Hidetaka
Shindo, Hiroyuki
Watanabe, Taro
contents Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.
format Preprint
id arxiv_https___arxiv_org_abs_2403_15430
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Distilling Named Entity Recognition Models for Endangered Species from Large Language Models
Atuhurra, Jesse
Dujohn, Seiveright Cargill
Kamigaito, Hidetaka
Shindo, Hiroyuki
Watanabe, Taro
Computation and Language
Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.
title Distilling Named Entity Recognition Models for Endangered Species from Large Language Models
topic Computation and Language
url https://arxiv.org/abs/2403.15430