Saved in:
Bibliographic Details
Main Authors: Appenzeller, Arno, Terzer, Nick, Homeyer, André, Redlich, Jan-Philipp, Luttmann, Sabine, Feuerhake, Friedrich, Schaadt, Nadine S., Intemann, Timm, Teuber-Hanselmann, Sarah, Nikolin, Stefan, Weis, Joachim, Kraywinkel, Klaus, Birnstill, Pascal
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.14721
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909968250175488
author Appenzeller, Arno
Terzer, Nick
Homeyer, André
Redlich, Jan-Philipp
Luttmann, Sabine
Feuerhake, Friedrich
Schaadt, Nadine S.
Intemann, Timm
Teuber-Hanselmann, Sarah
Nikolin, Stefan
Weis, Joachim
Kraywinkel, Klaus
Birnstill, Pascal
author_facet Appenzeller, Arno
Terzer, Nick
Homeyer, André
Redlich, Jan-Philipp
Luttmann, Sabine
Feuerhake, Friedrich
Schaadt, Nadine S.
Intemann, Timm
Teuber-Hanselmann, Sarah
Nikolin, Stefan
Weis, Joachim
Kraywinkel, Klaus
Birnstill, Pascal
contents The generation of synthetic data is a promising technology to make medical data available for secondary use in a privacy-compliant manner. A popular method for creating realistic patient data is the rule-based Synthea data generator. Synthea generates data based on rules describing the lifetime of a synthetic patient. These rules typically express the probability of a condition occurring, such as a disease, depending on factors like age. Since they only contain statistical information, rules usually have no specific data protection requirements. However, creating meaningful rules can be a very complex process that requires expert knowledge and realistic sample data. In this paper, we introduce and evaluate an approach to automatically generate Synthea rules based on statistics from tabular data, which we extracted from cancer reports. As an example use case, we created a Synthea module for glioblastoma from a real-world dataset and used it to generate a synthetic dataset. Compared to the original dataset, the synthetic data reproduced known disease courses and mostly retained the statistical properties. Overall, synthetic patient data holds great potential for privacy-preserving research. The data can be used to formulate hypotheses and to develop prototypes, but medical interpretation should consider the specific limitations as with any currently available approach.
format Preprint
id arxiv_https___arxiv_org_abs_2512_14721
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Automatic Extraction of Rules for Generating Synthetic Patient Data From Real-World Population Data Using Glioblastoma as an Example
Appenzeller, Arno
Terzer, Nick
Homeyer, André
Redlich, Jan-Philipp
Luttmann, Sabine
Feuerhake, Friedrich
Schaadt, Nadine S.
Intemann, Timm
Teuber-Hanselmann, Sarah
Nikolin, Stefan
Weis, Joachim
Kraywinkel, Klaus
Birnstill, Pascal
Machine Learning
I.6; J.3
The generation of synthetic data is a promising technology to make medical data available for secondary use in a privacy-compliant manner. A popular method for creating realistic patient data is the rule-based Synthea data generator. Synthea generates data based on rules describing the lifetime of a synthetic patient. These rules typically express the probability of a condition occurring, such as a disease, depending on factors like age. Since they only contain statistical information, rules usually have no specific data protection requirements. However, creating meaningful rules can be a very complex process that requires expert knowledge and realistic sample data. In this paper, we introduce and evaluate an approach to automatically generate Synthea rules based on statistics from tabular data, which we extracted from cancer reports. As an example use case, we created a Synthea module for glioblastoma from a real-world dataset and used it to generate a synthetic dataset. Compared to the original dataset, the synthetic data reproduced known disease courses and mostly retained the statistical properties. Overall, synthetic patient data holds great potential for privacy-preserving research. The data can be used to formulate hypotheses and to develop prototypes, but medical interpretation should consider the specific limitations as with any currently available approach.
title Automatic Extraction of Rules for Generating Synthetic Patient Data From Real-World Population Data Using Glioblastoma as an Example
topic Machine Learning
I.6; J.3
url https://arxiv.org/abs/2512.14721