Saved in:
Bibliographic Details
Main Authors: Casey, Arlene, Dunbar, Stuart, Gruber, Franz, McInerney, Samuel, Falis, Matúš, Linksted, Pamela, Wilde, Katie, Harrison, Kathy, Hamilton, Alison, Cole, Christian
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2506.02063
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Clinical free-text data offers immense potential to improve population health research such as richer phenotyping, symptom tracking, and contextual understanding of patient care. However, these data present significant privacy risks due to the presence of directly or indirectly identifying information embedded in unstructured narratives. While numerous de-identification tools have been developed, few have been tested on real-world, heterogeneous datasets at scale or assessed for governance readiness. In this paper, we synthesise our findings from previous studies examining the privacy-risk landscape across multiple document types and NHS data providers in Scotland. We characterise how direct and indirect identifiers vary by record type, clinical setting, and data flow, and show how changes in documentation practice can degrade model performance over time. Through public engagement, we explore societal expectations around the safe use of clinical free text and reflect these in the design of a prototype privacy-risk management tool to support transparent, auditable decision-making. Our findings highlight that privacy risk is context-dependent and cumulative, underscoring the need for adaptable, hybrid de-identification approaches that combine rule-based precision with contextual understanding. We offer a comprehensive view of the challenges and opportunities for safe, scalable reuse of clinical free-text within Trusted Research Environments and beyond, grounded in both technical evidence and public perspectives on responsible data use.