Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Appenzeller, Arno, Terzer, Nick, Homeyer, André, Redlich, Jan-Philipp, Luttmann, Sabine, Feuerhake, Friedrich, Schaadt, Nadine S., Intemann, Timm, Teuber-Hanselmann, Sarah, Nikolin, Stefan, Weis, Joachim, Kraywinkel, Klaus, Birnstill, Pascal
Format:	Preprint
Published:	2025
Subjects:	Machine Learning I.6; J.3
Online Access:	https://arxiv.org/abs/2512.14721
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909968250175488
author	Appenzeller, Arno Terzer, Nick Homeyer, André Redlich, Jan-Philipp Luttmann, Sabine Feuerhake, Friedrich Schaadt, Nadine S. Intemann, Timm Teuber-Hanselmann, Sarah Nikolin, Stefan Weis, Joachim Kraywinkel, Klaus Birnstill, Pascal
author_facet	Appenzeller, Arno Terzer, Nick Homeyer, André Redlich, Jan-Philipp Luttmann, Sabine Feuerhake, Friedrich Schaadt, Nadine S. Intemann, Timm Teuber-Hanselmann, Sarah Nikolin, Stefan Weis, Joachim Kraywinkel, Klaus Birnstill, Pascal
contents	The generation of synthetic data is a promising technology to make medical data available for secondary use in a privacy-compliant manner. A popular method for creating realistic patient data is the rule-based Synthea data generator. Synthea generates data based on rules describing the lifetime of a synthetic patient. These rules typically express the probability of a condition occurring, such as a disease, depending on factors like age. Since they only contain statistical information, rules usually have no specific data protection requirements. However, creating meaningful rules can be a very complex process that requires expert knowledge and realistic sample data. In this paper, we introduce and evaluate an approach to automatically generate Synthea rules based on statistics from tabular data, which we extracted from cancer reports. As an example use case, we created a Synthea module for glioblastoma from a real-world dataset and used it to generate a synthetic dataset. Compared to the original dataset, the synthetic data reproduced known disease courses and mostly retained the statistical properties. Overall, synthetic patient data holds great potential for privacy-preserving research. The data can be used to formulate hypotheses and to develop prototypes, but medical interpretation should consider the specific limitations as with any currently available approach.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_14721
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Automatic Extraction of Rules for Generating Synthetic Patient Data From Real-World Population Data Using Glioblastoma as an Example Appenzeller, Arno Terzer, Nick Homeyer, André Redlich, Jan-Philipp Luttmann, Sabine Feuerhake, Friedrich Schaadt, Nadine S. Intemann, Timm Teuber-Hanselmann, Sarah Nikolin, Stefan Weis, Joachim Kraywinkel, Klaus Birnstill, Pascal Machine Learning I.6; J.3 The generation of synthetic data is a promising technology to make medical data available for secondary use in a privacy-compliant manner. A popular method for creating realistic patient data is the rule-based Synthea data generator. Synthea generates data based on rules describing the lifetime of a synthetic patient. These rules typically express the probability of a condition occurring, such as a disease, depending on factors like age. Since they only contain statistical information, rules usually have no specific data protection requirements. However, creating meaningful rules can be a very complex process that requires expert knowledge and realistic sample data. In this paper, we introduce and evaluate an approach to automatically generate Synthea rules based on statistics from tabular data, which we extracted from cancer reports. As an example use case, we created a Synthea module for glioblastoma from a real-world dataset and used it to generate a synthetic dataset. Compared to the original dataset, the synthetic data reproduced known disease courses and mostly retained the statistical properties. Overall, synthetic patient data holds great potential for privacy-preserving research. The data can be used to formulate hypotheses and to develop prototypes, but medical interpretation should consider the specific limitations as with any currently available approach.
title	Automatic Extraction of Rules for Generating Synthetic Patient Data From Real-World Population Data Using Glioblastoma as an Example
topic	Machine Learning I.6; J.3
url	https://arxiv.org/abs/2512.14721

Similar Items