Guardat en:
Dades bibliogràfiques
Autor principal: Yip, Simon
Format: Recurso digital
Idioma:anglès
Publicat: Zenodo 2026
Matèries:
Accés en línia:https://doi.org/10.5281/zenodo.20018309
Etiquetes: Afegir etiqueta
Sense etiquetes, Sigues el primer a etiquetar aquest registre!
Taula de continguts:
  • <p>Cloud-deployed hybrid clinical NLP system for converting unstructured ICU progress notes into structured, auditable clinical entity outputs for downstream analysis and machine-learning workflows. The pipeline uses a hybrid extraction-validation architecture.</p> <p>Deterministic, section-aware regex rules first extract span-aligned candidate entities from ICU notes across three clinically meaningful categories: <code>SYMPTOM</code>, <code>INTERVENTION</code>, and <code>CLINICAL_CONDITION</code>. This rule-based layer is designed to provide broad candidate coverage, schema control, and exact text provenance. A fine-tuned BioClinicalBERT classifier then validates each candidate in sentence context. This transformer layer handles contextual ambiguity such as intent, negation, temporality, and uncertainty. The model was fine-tuned using 1,200 manually annotated entity examples, with threshold tuning used to prioritise precision for the final structured outputs.</p> <p>The system was developed on a filtered PhysioNet MIMIC-IV ICU note corpus of 162,296 progress reports across 32,910 ICU stays. Full-corpus execution generated 780,941 candidate entities, of which 319,852 were classified as valid after transformer validation (40.96% retained). Compared with the rule-based baseline, BioClinicalBERT validation substantially improved precision and reduced false positives on an evaluation set, while lowering recall due to stricter filtering. Precision increased from 0.571 to 0.833 (+45.9% relative improvement), and false positives decreased from 66 to 11 (-83.3%).</p> <p>The final system supports both large-scale offline corpus processing and real-time inference through a stateless FastAPI service, containerised with Docker and deployed on Google Cloud Run. GitHub Actions CI/CD automates reproducible deployment updates. </p> <p>This work is research-focused and is not a live clinical decision-support system or regulatory-validated medical device.</p> <p> </p> <table> <tbody> <tr> <td> <div> <div><strong>System</strong></div> </div> </td> <td> <div> <div><strong>Precision</strong></div> </div> </td> <td><strong>Recall</strong></td> <td><strong>F1-Score</strong></td> <td><strong>False Positives</strong></td> <td><strong>Interpretation</strong></td> </tr> <tr> <td> <div> <div>Rule-based baseline</div> </div> </td> <td> <div> <div>0.571</div> </div> </td> <td> <div> <div>0.989</div> </div> </td> <td> <div> <div>0.724</div> </div> </td> <td> <div> <div>66</div> </div> </td> <td> <div> <div>Broad candidate generation with near-complete recall but high noise</div> </div> </td> </tr> <tr> <td> <div> <div>BioClinicalBERT validation</div> </div> </td> <td> <div> <div>0.833</div> </div> </td> <td> <div> <div>0.618</div> </div> </td> <td> <div> <div>0.710</div> </div> </td> <td> <div> <div>11</div> </div> </td> <td> <div> <div>Cleaner final outputs with substantially fewer false positives</div> </div> </td> </tr> <tr> <td> <div> <div>Change</div> </div> </td> <td> <div> <div>+0.262</div> </div> </td> <td> <div> <div>-0.371</div> </div> </td> <td> <div> <div>-0.014</div> </div> </td> <td> <div> <div>-55</div> </div> </td> <td> <div> <div>Precision improved substantially; recall loss reflects conservative filtering</div> </div> </td> </tr> </tbody> </table> <p> </p>