Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Jacobs, Niklas, Voelkle, Manuel C., Kathmann, Norbert, Hilbert, Kevin
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Machine Learning
Online-Zugang:	https://arxiv.org/abs/2601.06159
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866914244595810304
author	Jacobs, Niklas Voelkle, Manuel C. Kathmann, Norbert Hilbert, Kevin
author_facet	Jacobs, Niklas Voelkle, Manuel C. Kathmann, Norbert Hilbert, Kevin
contents	In the context of personalized medicine, machine learning algorithms are growing in popularity. These algorithms require substantial information, which can be acquired effectively through the usage of previously gathered data. Open data and the utilization of synthetization techniques have been proposed to address this. In this paper, we propose and evaluate alternative approach that uses additional simulated data based on summary statistics published in the literature. The simulated data are used to pretrain random forests, which are afterwards fine-tuned on a real dataset. We compare the predictive performance of the new approach to random forests trained only on the real data. A Monte Carlo Cross Validation (MCCV) framework with 100 iterations was employed to investigate significance and stability of the results. Since a first study yielded inconclusive results, a second study with improved methodology (i.e., systematic information extraction and different prediction outcome) was conducted. In Study 1, some pretrained random forests descriptively outperformed the standard random forest. However, this improvement was not significant (t(99) = 0.89, p = 0.19). Contrary to expectations, in Study 2 the random forest trained only with the real data outperformed the pretrained random forests. We conclude with a discussion of challenges, such as the scarcity of informative publications, and recommendations for future research.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_06159
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Can we Improve Prediction of Psychotherapy Outcomes Through Pretraining With Simulated Data? Jacobs, Niklas Voelkle, Manuel C. Kathmann, Norbert Hilbert, Kevin Machine Learning In the context of personalized medicine, machine learning algorithms are growing in popularity. These algorithms require substantial information, which can be acquired effectively through the usage of previously gathered data. Open data and the utilization of synthetization techniques have been proposed to address this. In this paper, we propose and evaluate alternative approach that uses additional simulated data based on summary statistics published in the literature. The simulated data are used to pretrain random forests, which are afterwards fine-tuned on a real dataset. We compare the predictive performance of the new approach to random forests trained only on the real data. A Monte Carlo Cross Validation (MCCV) framework with 100 iterations was employed to investigate significance and stability of the results. Since a first study yielded inconclusive results, a second study with improved methodology (i.e., systematic information extraction and different prediction outcome) was conducted. In Study 1, some pretrained random forests descriptively outperformed the standard random forest. However, this improvement was not significant (t(99) = 0.89, p = 0.19). Contrary to expectations, in Study 2 the random forest trained only with the real data outperformed the pretrained random forests. We conclude with a discussion of challenges, such as the scarcity of informative publications, and recommendations for future research.
title	Can we Improve Prediction of Psychotherapy Outcomes Through Pretraining With Simulated Data?
topic	Machine Learning
url	https://arxiv.org/abs/2601.06159

Ähnliche Einträge