Saved in:
Bibliographic Details
Main Authors: Moreno-Casanova, J., Auñón, J. M., Mártinez-Pérez, A., Pérez-Martínez, M. E., Gas-López, M. E.
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.09794
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908364278071296
author Moreno-Casanova, J.
Auñón, J. M.
Mártinez-Pérez, A.
Pérez-Martínez, M. E.
Gas-López, M. E.
author_facet Moreno-Casanova, J.
Auñón, J. M.
Mártinez-Pérez, A.
Pérez-Martínez, M. E.
Gas-López, M. E.
contents Research projects, including those focused on cancer, rely on the manual extraction of information from clinical reports. This process is time-consuming and prone to errors, limiting the efficiency of data-driven approaches in healthcare. To address these challenges, Natural Language Processing (NLP) offers an alternative for automating the extraction of relevant data from electronic health records (EHRs). In this study, we focus on lung and breast cancer due to their high incidence and the significant impact they have on public health. Early detection and effective data management in both types of cancer are crucial for improving patient outcomes. To enhance the accuracy and efficiency of data extraction, we utilized GMV's NLP tool uQuery, which excels at identifying relevant entities in clinical texts and converting them into standardized formats such as SNOMED and OMOP. uQuery not only detects and classifies entities but also associates them with contextual information, including negated entities, temporal aspects, and patient-related details. In this work, we explore the use of NLP techniques, specifically Named Entity Recognition (NER), to automatically identify and extract key clinical information from EHRs related to these two cancers. A dataset from Health Research Institute Hospital La Fe (IIS La Fe), comprising 200 annotated breast cancer and 400 lung cancer reports, was used, with eight clinical entities manually labeled using the Doccano platform. To perform NER, we fine-tuned the bsc-bio-ehr-en3 model, a RoBERTa-based biomedical linguistic model pre-trained in Spanish. Fine-tuning was performed using the Transformers architecture, enabling accurate recognition of clinical entities in these cancer types. Our results demonstrate strong overall performance, particularly in identifying entities like MET and PAT, although challenges remain with less frequent entities like EVOL.
format Preprint
id arxiv_https___arxiv_org_abs_2505_09794
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques
Moreno-Casanova, J.
Auñón, J. M.
Mártinez-Pérez, A.
Pérez-Martínez, M. E.
Gas-López, M. E.
Computation and Language
Artificial Intelligence
Research projects, including those focused on cancer, rely on the manual extraction of information from clinical reports. This process is time-consuming and prone to errors, limiting the efficiency of data-driven approaches in healthcare. To address these challenges, Natural Language Processing (NLP) offers an alternative for automating the extraction of relevant data from electronic health records (EHRs). In this study, we focus on lung and breast cancer due to their high incidence and the significant impact they have on public health. Early detection and effective data management in both types of cancer are crucial for improving patient outcomes. To enhance the accuracy and efficiency of data extraction, we utilized GMV's NLP tool uQuery, which excels at identifying relevant entities in clinical texts and converting them into standardized formats such as SNOMED and OMOP. uQuery not only detects and classifies entities but also associates them with contextual information, including negated entities, temporal aspects, and patient-related details. In this work, we explore the use of NLP techniques, specifically Named Entity Recognition (NER), to automatically identify and extract key clinical information from EHRs related to these two cancers. A dataset from Health Research Institute Hospital La Fe (IIS La Fe), comprising 200 annotated breast cancer and 400 lung cancer reports, was used, with eight clinical entities manually labeled using the Doccano platform. To perform NER, we fine-tuned the bsc-bio-ehr-en3 model, a RoBERTa-based biomedical linguistic model pre-trained in Spanish. Fine-tuning was performed using the Transformers architecture, enabling accurate recognition of clinical entities in these cancer types. Our results demonstrate strong overall performance, particularly in identifying entities like MET and PAT, although challenges remain with less frequent entities like EVOL.
title Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2505.09794