MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Hong, Gibong, Hindle, Veronica, Veasley, Nadine M., Holscher, Hannah D., Kilicoglu, Halil
Natura:	Preprint
Pubblicazione:	2024
Soggetti:	Computation and Language
Accesso online:	https://arxiv.org/abs/2409.19581
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866917970560679936
author	Hong, Gibong Hindle, Veronica Veasley, Nadine M. Holscher, Hannah D. Kilicoglu, Halil
author_facet	Hong, Gibong Hindle, Veronica Veasley, Nadine M. Holscher, Hannah D. Kilicoglu, Halil
contents	Objective: To develop a corpus annotated for diet-microbiome associations from the biomedical literature and train natural language processing (NLP) models to identify these associations, thereby improving the understanding of their role in health and disease, and supporting personalized nutrition strategies. Materials and Methods: We constructed DiMB-RE, a comprehensive corpus annotated with 15 entity types (e.g., Nutrient, Microorganism) and 13 relation types (e.g., INCREASES, IMPROVES) capturing diet-microbiome associations. We fine-tuned and evaluated state-of-the-art NLP models for named entity, trigger, and relation extraction as well as factuality detection using DiMB-RE. In addition, we benchmarked two generative large language models (GPT-4o-mini and GPT-4o) on a subset of the dataset in zero- and one-shot settings. Results: DiMB-RE consists of 14,450 entities and 4,206 relationships from 165 publications (including 30 full-text Results sections). Fine-tuned NLP models performed reasonably well for named entity recognition (0.800 F1 score), while end-to-end relation extraction performance was modest (0.445 F1). The use of Results section annotations improved relation extraction. The impact of trigger detection was mixed. Generative models showed lower accuracy compared to fine-tuned models. Discussion: To our knowledge, DiMB-RE is the largest and most diverse corpus focusing on diet-microbiome interactions. NLP models fine-tuned on DiMB-RE exhibit lower performance compared to similar corpora, highlighting the complexity of information extraction in this domain. Misclassified entities, missed triggers, and cross-sentence relations are the major sources of relation extraction errors. Conclusions: DiMB-RE can serve as a benchmark corpus for biomedical literature mining. DiMB-RE and the NLP models are available at https://github.com/ScienceNLP-Lab/DiMB-RE.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_19581
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	DiMB-RE: Mining the Scientific Literature for Diet-Microbiome Associations Hong, Gibong Hindle, Veronica Veasley, Nadine M. Holscher, Hannah D. Kilicoglu, Halil Computation and Language Objective: To develop a corpus annotated for diet-microbiome associations from the biomedical literature and train natural language processing (NLP) models to identify these associations, thereby improving the understanding of their role in health and disease, and supporting personalized nutrition strategies. Materials and Methods: We constructed DiMB-RE, a comprehensive corpus annotated with 15 entity types (e.g., Nutrient, Microorganism) and 13 relation types (e.g., INCREASES, IMPROVES) capturing diet-microbiome associations. We fine-tuned and evaluated state-of-the-art NLP models for named entity, trigger, and relation extraction as well as factuality detection using DiMB-RE. In addition, we benchmarked two generative large language models (GPT-4o-mini and GPT-4o) on a subset of the dataset in zero- and one-shot settings. Results: DiMB-RE consists of 14,450 entities and 4,206 relationships from 165 publications (including 30 full-text Results sections). Fine-tuned NLP models performed reasonably well for named entity recognition (0.800 F1 score), while end-to-end relation extraction performance was modest (0.445 F1). The use of Results section annotations improved relation extraction. The impact of trigger detection was mixed. Generative models showed lower accuracy compared to fine-tuned models. Discussion: To our knowledge, DiMB-RE is the largest and most diverse corpus focusing on diet-microbiome interactions. NLP models fine-tuned on DiMB-RE exhibit lower performance compared to similar corpora, highlighting the complexity of information extraction in this domain. Misclassified entities, missed triggers, and cross-sentence relations are the major sources of relation extraction errors. Conclusions: DiMB-RE can serve as a benchmark corpus for biomedical literature mining. DiMB-RE and the NLP models are available at https://github.com/ScienceNLP-Lab/DiMB-RE.
title	DiMB-RE: Mining the Scientific Literature for Diet-Microbiome Associations
topic	Computation and Language
url	https://arxiv.org/abs/2409.19581

Documenti analoghi