Salvato in:
Dettagli Bibliografici
Autori principali: Hong, Gibong, Hindle, Veronica, Veasley, Nadine M., Holscher, Hannah D., Kilicoglu, Halil
Natura: Preprint
Pubblicazione: 2024
Soggetti:
Accesso online:https://arxiv.org/abs/2409.19581
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866917970560679936
author Hong, Gibong
Hindle, Veronica
Veasley, Nadine M.
Holscher, Hannah D.
Kilicoglu, Halil
author_facet Hong, Gibong
Hindle, Veronica
Veasley, Nadine M.
Holscher, Hannah D.
Kilicoglu, Halil
contents Objective: To develop a corpus annotated for diet-microbiome associations from the biomedical literature and train natural language processing (NLP) models to identify these associations, thereby improving the understanding of their role in health and disease, and supporting personalized nutrition strategies. Materials and Methods: We constructed DiMB-RE, a comprehensive corpus annotated with 15 entity types (e.g., Nutrient, Microorganism) and 13 relation types (e.g., INCREASES, IMPROVES) capturing diet-microbiome associations. We fine-tuned and evaluated state-of-the-art NLP models for named entity, trigger, and relation extraction as well as factuality detection using DiMB-RE. In addition, we benchmarked two generative large language models (GPT-4o-mini and GPT-4o) on a subset of the dataset in zero- and one-shot settings. Results: DiMB-RE consists of 14,450 entities and 4,206 relationships from 165 publications (including 30 full-text Results sections). Fine-tuned NLP models performed reasonably well for named entity recognition (0.800 F1 score), while end-to-end relation extraction performance was modest (0.445 F1). The use of Results section annotations improved relation extraction. The impact of trigger detection was mixed. Generative models showed lower accuracy compared to fine-tuned models. Discussion: To our knowledge, DiMB-RE is the largest and most diverse corpus focusing on diet-microbiome interactions. NLP models fine-tuned on DiMB-RE exhibit lower performance compared to similar corpora, highlighting the complexity of information extraction in this domain. Misclassified entities, missed triggers, and cross-sentence relations are the major sources of relation extraction errors. Conclusions: DiMB-RE can serve as a benchmark corpus for biomedical literature mining. DiMB-RE and the NLP models are available at https://github.com/ScienceNLP-Lab/DiMB-RE.
format Preprint
id arxiv_https___arxiv_org_abs_2409_19581
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle DiMB-RE: Mining the Scientific Literature for Diet-Microbiome Associations
Hong, Gibong
Hindle, Veronica
Veasley, Nadine M.
Holscher, Hannah D.
Kilicoglu, Halil
Computation and Language
Objective: To develop a corpus annotated for diet-microbiome associations from the biomedical literature and train natural language processing (NLP) models to identify these associations, thereby improving the understanding of their role in health and disease, and supporting personalized nutrition strategies. Materials and Methods: We constructed DiMB-RE, a comprehensive corpus annotated with 15 entity types (e.g., Nutrient, Microorganism) and 13 relation types (e.g., INCREASES, IMPROVES) capturing diet-microbiome associations. We fine-tuned and evaluated state-of-the-art NLP models for named entity, trigger, and relation extraction as well as factuality detection using DiMB-RE. In addition, we benchmarked two generative large language models (GPT-4o-mini and GPT-4o) on a subset of the dataset in zero- and one-shot settings. Results: DiMB-RE consists of 14,450 entities and 4,206 relationships from 165 publications (including 30 full-text Results sections). Fine-tuned NLP models performed reasonably well for named entity recognition (0.800 F1 score), while end-to-end relation extraction performance was modest (0.445 F1). The use of Results section annotations improved relation extraction. The impact of trigger detection was mixed. Generative models showed lower accuracy compared to fine-tuned models. Discussion: To our knowledge, DiMB-RE is the largest and most diverse corpus focusing on diet-microbiome interactions. NLP models fine-tuned on DiMB-RE exhibit lower performance compared to similar corpora, highlighting the complexity of information extraction in this domain. Misclassified entities, missed triggers, and cross-sentence relations are the major sources of relation extraction errors. Conclusions: DiMB-RE can serve as a benchmark corpus for biomedical literature mining. DiMB-RE and the NLP models are available at https://github.com/ScienceNLP-Lab/DiMB-RE.
title DiMB-RE: Mining the Scientific Literature for Diet-Microbiome Associations
topic Computation and Language
url https://arxiv.org/abs/2409.19581