Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Jacobs, Niklas, Voelkle, Manuel C., Kathmann, Norbert, Hilbert, Kevin
Format: Preprint
Veröffentlicht: 2026
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2601.06159
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866914244595810304
author Jacobs, Niklas
Voelkle, Manuel C.
Kathmann, Norbert
Hilbert, Kevin
author_facet Jacobs, Niklas
Voelkle, Manuel C.
Kathmann, Norbert
Hilbert, Kevin
contents In the context of personalized medicine, machine learning algorithms are growing in popularity. These algorithms require substantial information, which can be acquired effectively through the usage of previously gathered data. Open data and the utilization of synthetization techniques have been proposed to address this. In this paper, we propose and evaluate alternative approach that uses additional simulated data based on summary statistics published in the literature. The simulated data are used to pretrain random forests, which are afterwards fine-tuned on a real dataset. We compare the predictive performance of the new approach to random forests trained only on the real data. A Monte Carlo Cross Validation (MCCV) framework with 100 iterations was employed to investigate significance and stability of the results. Since a first study yielded inconclusive results, a second study with improved methodology (i.e., systematic information extraction and different prediction outcome) was conducted. In Study 1, some pretrained random forests descriptively outperformed the standard random forest. However, this improvement was not significant (t(99) = 0.89, p = 0.19). Contrary to expectations, in Study 2 the random forest trained only with the real data outperformed the pretrained random forests. We conclude with a discussion of challenges, such as the scarcity of informative publications, and recommendations for future research.
format Preprint
id arxiv_https___arxiv_org_abs_2601_06159
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Can we Improve Prediction of Psychotherapy Outcomes Through Pretraining With Simulated Data?
Jacobs, Niklas
Voelkle, Manuel C.
Kathmann, Norbert
Hilbert, Kevin
Machine Learning
In the context of personalized medicine, machine learning algorithms are growing in popularity. These algorithms require substantial information, which can be acquired effectively through the usage of previously gathered data. Open data and the utilization of synthetization techniques have been proposed to address this. In this paper, we propose and evaluate alternative approach that uses additional simulated data based on summary statistics published in the literature. The simulated data are used to pretrain random forests, which are afterwards fine-tuned on a real dataset. We compare the predictive performance of the new approach to random forests trained only on the real data. A Monte Carlo Cross Validation (MCCV) framework with 100 iterations was employed to investigate significance and stability of the results. Since a first study yielded inconclusive results, a second study with improved methodology (i.e., systematic information extraction and different prediction outcome) was conducted. In Study 1, some pretrained random forests descriptively outperformed the standard random forest. However, this improvement was not significant (t(99) = 0.89, p = 0.19). Contrary to expectations, in Study 2 the random forest trained only with the real data outperformed the pretrained random forests. We conclude with a discussion of challenges, such as the scarcity of informative publications, and recommendations for future research.
title Can we Improve Prediction of Psychotherapy Outcomes Through Pretraining With Simulated Data?
topic Machine Learning
url https://arxiv.org/abs/2601.06159