Saved in:
Bibliographic Details
Main Authors: Gallego, Fernando, López-García, Guillermo, Gasco-Sánchez, Luis, Krallinger, Martin, Veredas, Francisco J.
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2404.06367
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916198511280128
author Gallego, Fernando
López-García, Guillermo
Gasco-Sánchez, Luis
Krallinger, Martin
Veredas, Francisco J.
author_facet Gallego, Fernando
López-García, Guillermo
Gasco-Sánchez, Luis
Krallinger, Martin
Veredas, Francisco J.
contents Advances in natural language processing techniques, such as named entity recognition and normalization to widely used standardized terminologies like UMLS or SNOMED-CT, along with the digitalization of electronic health records, have significantly advanced clinical text analysis. This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking that leverages the potential of in-domain adapted language models for biomedical text mining: initial candidate retrieval using a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish. This methodology, focused initially on content in Spanish, substantially outperforming multilingual language models designed for the same purpose. This is true even for complex scenarios involving heterogeneous medical terminologies and being trained on a subset of the original data. Our results, evaluated using top-k accuracy at 25 and other top-k metrics, demonstrate our approach's performance on two distinct clinical entity linking Gold Standard corpora, DisTEMIST (diseases) and MedProcNER (clinical procedures), outperforming previous benchmarks by 40 points in DisTEMIST and 43 points in MedProcNER, both normalized to SNOMED-CT codes. These findings highlight our approach's ability to address language-specific nuances and set a new benchmark in entity linking, offering a potent tool for enhancing the utility of digital medical records. The resulting system is of practical value, both for large scale automatic generation of structured data derived from clinical records, as well as for exhaustive extraction and harmonization of predefined clinical variables of interest.
format Preprint
id arxiv_https___arxiv_org_abs_2404_06367
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish
Gallego, Fernando
López-García, Guillermo
Gasco-Sánchez, Luis
Krallinger, Martin
Veredas, Francisco J.
Computation and Language
Advances in natural language processing techniques, such as named entity recognition and normalization to widely used standardized terminologies like UMLS or SNOMED-CT, along with the digitalization of electronic health records, have significantly advanced clinical text analysis. This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking that leverages the potential of in-domain adapted language models for biomedical text mining: initial candidate retrieval using a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish. This methodology, focused initially on content in Spanish, substantially outperforming multilingual language models designed for the same purpose. This is true even for complex scenarios involving heterogeneous medical terminologies and being trained on a subset of the original data. Our results, evaluated using top-k accuracy at 25 and other top-k metrics, demonstrate our approach's performance on two distinct clinical entity linking Gold Standard corpora, DisTEMIST (diseases) and MedProcNER (clinical procedures), outperforming previous benchmarks by 40 points in DisTEMIST and 43 points in MedProcNER, both normalized to SNOMED-CT codes. These findings highlight our approach's ability to address language-specific nuances and set a new benchmark in entity linking, offering a potent tool for enhancing the utility of digital medical records. The resulting system is of practical value, both for large scale automatic generation of structured data derived from clinical records, as well as for exhaustive extraction and harmonization of predefined clinical variables of interest.
title ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish
topic Computation and Language
url https://arxiv.org/abs/2404.06367