Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Atuhurra, Jesse, Dujohn, Seiveright Cargill, Kamigaito, Hidetaka, Shindo, Hiroyuki, Watanabe, Taro
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2403.15430
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910379290918912
author	Atuhurra, Jesse Dujohn, Seiveright Cargill Kamigaito, Hidetaka Shindo, Hiroyuki Watanabe, Taro
author_facet	Atuhurra, Jesse Dujohn, Seiveright Cargill Kamigaito, Hidetaka Shindo, Hiroyuki Watanabe, Taro
contents	Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_15430
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Distilling Named Entity Recognition Models for Endangered Species from Large Language Models Atuhurra, Jesse Dujohn, Seiveright Cargill Kamigaito, Hidetaka Shindo, Hiroyuki Watanabe, Taro Computation and Language Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.
title	Distilling Named Entity Recognition Models for Endangered Species from Large Language Models
topic	Computation and Language
url	https://arxiv.org/abs/2403.15430

Similar Items