Saved in:
Bibliographic Details
Main Authors: Chuang, Yao-Shun, Sarkar, Atiquer Rahman, Hsu, Yu-Chun, Mohammed, Noman, Jiang, Xiaoqian
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2407.16166
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912406882484224
author Chuang, Yao-Shun
Sarkar, Atiquer Rahman
Hsu, Yu-Chun
Mohammed, Noman
Jiang, Xiaoqian
author_facet Chuang, Yao-Shun
Sarkar, Atiquer Rahman
Hsu, Yu-Chun
Mohammed, Noman
Jiang, Xiaoqian
contents This study examines integrating EHRs and NLP with large language models (LLMs) to improve healthcare data management and patient care. It focuses on using advanced models to create secure, HIPAA-compliant synthetic patient notes for biomedical research. The study used de-identified and re-identified MIMIC III datasets with GPT-3.5, GPT-4, and Mistral 7B to generate synthetic notes. Text generation employed templates and keyword extraction for contextually relevant notes, with one-shot generation for comparison. Privacy assessment checked PHI occurrence, while text utility was tested using an ICD-9 coding task. Text quality was evaluated with ROUGE and cosine similarity metrics to measure semantic similarity with source notes. Analysis of PHI occurrence and text utility via the ICD-9 coding task showed that the keyword-based method had low risk and good performance. One-shot generation showed the highest PHI exposure and PHI co-occurrence, especially in geographic location and date categories. The Normalized One-shot method achieved the highest classification accuracy. Privacy analysis revealed a critical balance between data utility and privacy protection, influencing future data use and sharing. Re-identified data consistently outperformed de-identified data. This study demonstrates the effectiveness of keyword-based methods in generating privacy-protecting synthetic clinical notes that retain data usability, potentially transforming clinical data-sharing practices. The superior performance of re-identified over de-identified data suggests a shift towards methods that enhance utility and privacy by using dummy PHIs to perplex privacy attacks.
format Preprint
id arxiv_https___arxiv_org_abs_2407_16166
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Robust Privacy Amidst Innovation with Large Language Models Through a Critical Assessment of the Risks
Chuang, Yao-Shun
Sarkar, Atiquer Rahman
Hsu, Yu-Chun
Mohammed, Noman
Jiang, Xiaoqian
Computation and Language
This study examines integrating EHRs and NLP with large language models (LLMs) to improve healthcare data management and patient care. It focuses on using advanced models to create secure, HIPAA-compliant synthetic patient notes for biomedical research. The study used de-identified and re-identified MIMIC III datasets with GPT-3.5, GPT-4, and Mistral 7B to generate synthetic notes. Text generation employed templates and keyword extraction for contextually relevant notes, with one-shot generation for comparison. Privacy assessment checked PHI occurrence, while text utility was tested using an ICD-9 coding task. Text quality was evaluated with ROUGE and cosine similarity metrics to measure semantic similarity with source notes. Analysis of PHI occurrence and text utility via the ICD-9 coding task showed that the keyword-based method had low risk and good performance. One-shot generation showed the highest PHI exposure and PHI co-occurrence, especially in geographic location and date categories. The Normalized One-shot method achieved the highest classification accuracy. Privacy analysis revealed a critical balance between data utility and privacy protection, influencing future data use and sharing. Re-identified data consistently outperformed de-identified data. This study demonstrates the effectiveness of keyword-based methods in generating privacy-protecting synthetic clinical notes that retain data usability, potentially transforming clinical data-sharing practices. The superior performance of re-identified over de-identified data suggests a shift towards methods that enhance utility and privacy by using dummy PHIs to perplex privacy attacks.
title Robust Privacy Amidst Innovation with Large Language Models Through a Critical Assessment of the Risks
topic Computation and Language
url https://arxiv.org/abs/2407.16166