Saved in:
Bibliographic Details
Main Authors: Zhou, Doudou, Tong, Han, Wang, Linshanshan, Liu, Suqi, Xiong, Xin, Gan, Ziming, Griffier, Romain, Hejblum, Boris, Liu, Yun-Chung, Hong, Chuan, Bonzel, Clara-Lea, Cai, Tianrun, Pan, Kevin, Ho, Yuk-Lam, Costa, Lauren, Panickan, Vidul A., Gaziano, J. Michael, Mandl, Kenneth, Jouhet, Vianney, Thiebaut, Rodolphe, Xia, Zongqi, Cho, Kelly, Liao, Katherine, Cai, Tianxi
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.08547
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911565678116864
author Zhou, Doudou
Tong, Han
Wang, Linshanshan
Liu, Suqi
Xiong, Xin
Gan, Ziming
Griffier, Romain
Hejblum, Boris
Liu, Yun-Chung
Hong, Chuan
Bonzel, Clara-Lea
Cai, Tianrun
Pan, Kevin
Ho, Yuk-Lam
Costa, Lauren
Panickan, Vidul A.
Gaziano, J. Michael
Mandl, Kenneth
Jouhet, Vianney
Thiebaut, Rodolphe
Xia, Zongqi
Cho, Kelly
Liao, Katherine
Cai, Tianxi
author_facet Zhou, Doudou
Tong, Han
Wang, Linshanshan
Liu, Suqi
Xiong, Xin
Gan, Ziming
Griffier, Romain
Hejblum, Boris
Liu, Yun-Chung
Hong, Chuan
Bonzel, Clara-Lea
Cai, Tianrun
Pan, Kevin
Ho, Yuk-Lam
Costa, Lauren
Panickan, Vidul A.
Gaziano, J. Michael
Mandl, Kenneth
Jouhet, Vianney
Thiebaut, Rodolphe
Xia, Zongqi
Cho, Kelly
Liao, Katherine
Cai, Tianxi
contents The widespread adoption of electronic health records has created new opportunities for translational clinical research, yet this promise remains constrained by fragmented data across privacy-siloed institutions and substantial heterogeneity in local coding practices. While privacy-preserving collaborative learning allows institutions to work together without sharing patient-level data, it does not address inconsistencies in how clinical concepts are represented across sites. We introduce a graph-based framework that addresses this gap by treating data harmonization as a scalable representation learning problem. Rather than relying on fixed standards or manual mappings, the framework integrates institution-specific summary statistics from health records, curated biomedical knowledge graphs, and semantic information derived from large language models to learn a shared semantic space. This joint learning approach aligns diverse, site-specific vocabularies while preserving patient privacy. Evaluated across seven institutions and two languages, the framework provides a robust, data-centric foundation for training and deploying clinical models across heterogeneous healthcare systems.
format Preprint
id arxiv_https___arxiv_org_abs_2502_08547
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Representation learning to advance multi-institutional studies with electronic health record data from US and France
Zhou, Doudou
Tong, Han
Wang, Linshanshan
Liu, Suqi
Xiong, Xin
Gan, Ziming
Griffier, Romain
Hejblum, Boris
Liu, Yun-Chung
Hong, Chuan
Bonzel, Clara-Lea
Cai, Tianrun
Pan, Kevin
Ho, Yuk-Lam
Costa, Lauren
Panickan, Vidul A.
Gaziano, J. Michael
Mandl, Kenneth
Jouhet, Vianney
Thiebaut, Rodolphe
Xia, Zongqi
Cho, Kelly
Liao, Katherine
Cai, Tianxi
Artificial Intelligence
The widespread adoption of electronic health records has created new opportunities for translational clinical research, yet this promise remains constrained by fragmented data across privacy-siloed institutions and substantial heterogeneity in local coding practices. While privacy-preserving collaborative learning allows institutions to work together without sharing patient-level data, it does not address inconsistencies in how clinical concepts are represented across sites. We introduce a graph-based framework that addresses this gap by treating data harmonization as a scalable representation learning problem. Rather than relying on fixed standards or manual mappings, the framework integrates institution-specific summary statistics from health records, curated biomedical knowledge graphs, and semantic information derived from large language models to learn a shared semantic space. This joint learning approach aligns diverse, site-specific vocabularies while preserving patient privacy. Evaluated across seven institutions and two languages, the framework provides a robust, data-centric foundation for training and deploying clinical models across heterogeneous healthcare systems.
title Representation learning to advance multi-institutional studies with electronic health record data from US and France
topic Artificial Intelligence
url https://arxiv.org/abs/2502.08547