Saved in:
Bibliographic Details
Main Authors: Zhou, Doudou, Tong, Han, Wang, Linshanshan, Liu, Suqi, Xiong, Xin, Gan, Ziming, Griffier, Romain, Hejblum, Boris, Liu, Yun-Chung, Hong, Chuan, Bonzel, Clara-Lea, Cai, Tianrun, Pan, Kevin, Ho, Yuk-Lam, Costa, Lauren, Panickan, Vidul A., Gaziano, J. Michael, Mandl, Kenneth, Jouhet, Vianney, Thiebaut, Rodolphe, Xia, Zongqi, Cho, Kelly, Liao, Katherine, Cai, Tianxi
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.08547
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • The widespread adoption of electronic health records has created new opportunities for translational clinical research, yet this promise remains constrained by fragmented data across privacy-siloed institutions and substantial heterogeneity in local coding practices. While privacy-preserving collaborative learning allows institutions to work together without sharing patient-level data, it does not address inconsistencies in how clinical concepts are represented across sites. We introduce a graph-based framework that addresses this gap by treating data harmonization as a scalable representation learning problem. Rather than relying on fixed standards or manual mappings, the framework integrates institution-specific summary statistics from health records, curated biomedical knowledge graphs, and semantic information derived from large language models to learn a shared semantic space. This joint learning approach aligns diverse, site-specific vocabularies while preserving patient privacy. Evaluated across seven institutions and two languages, the framework provides a robust, data-centric foundation for training and deploying clinical models across heterogeneous healthcare systems.