Saved in:
Bibliographic Details
Main Authors: Chen, Li-Chin, Sheu, Ji-Tian, Chuang, Yuh-Jue
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.07330
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911208108457984
author Chen, Li-Chin
Sheu, Ji-Tian
Chuang, Yuh-Jue
author_facet Chen, Li-Chin
Sheu, Ji-Tian
Chuang, Yuh-Jue
contents Demographic attributes are universally present in electronic health records. They are the most widespread information across populations and diseases, and serve as vital predictors in clinical risk stratification and treatment decisions. Despite their significance, these attributes are often treated as auxiliaries in model design, with limited attention being paid to learning their representations. This study explored the development of a General Demographic Pre-trained (GDP) model as a foundational model tailored to demographic attributes, focusing on age and gender. The model is pre-trained and evaluated using datasets with diverse diseases and populations compositions from different geographic regions. The composition of GDP architecture was explored through examining combinations of ordering approaches and encoding methods to transform tabular demographic inputs into effective latent embeddings. Results demonstrate the feasibility of GDP to generalize across task, diseases, and populations. In detailed composition, the sequential ordering substantially improves model performance in discrimination, calibration, and the corresponding information gain at each decision tree split, particularly in diseases where age and gender contribute significantly to risk stratification. Even in datasets where demographic attributes hold relatively low predictive value, GDP enhances the representational importance, increasing their influence in downstream gradient boosting models. The findings suggest that foundation models for tabular demographic attributes offer a promising direction for improving predictive performance in healthcare applications.
format Preprint
id arxiv_https___arxiv_org_abs_2509_07330
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle General Demographic Foundation Models for Enhancing Predictive Performance Across Diseases and Populations
Chen, Li-Chin
Sheu, Ji-Tian
Chuang, Yuh-Jue
Machine Learning
Artificial Intelligence
Demographic attributes are universally present in electronic health records. They are the most widespread information across populations and diseases, and serve as vital predictors in clinical risk stratification and treatment decisions. Despite their significance, these attributes are often treated as auxiliaries in model design, with limited attention being paid to learning their representations. This study explored the development of a General Demographic Pre-trained (GDP) model as a foundational model tailored to demographic attributes, focusing on age and gender. The model is pre-trained and evaluated using datasets with diverse diseases and populations compositions from different geographic regions. The composition of GDP architecture was explored through examining combinations of ordering approaches and encoding methods to transform tabular demographic inputs into effective latent embeddings. Results demonstrate the feasibility of GDP to generalize across task, diseases, and populations. In detailed composition, the sequential ordering substantially improves model performance in discrimination, calibration, and the corresponding information gain at each decision tree split, particularly in diseases where age and gender contribute significantly to risk stratification. Even in datasets where demographic attributes hold relatively low predictive value, GDP enhances the representational importance, increasing their influence in downstream gradient boosting models. The findings suggest that foundation models for tabular demographic attributes offer a promising direction for improving predictive performance in healthcare applications.
title General Demographic Foundation Models for Enhancing Predictive Performance Across Diseases and Populations
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2509.07330