Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Martin, Glen P., Bladon, Sian, Whittle, Rebecca, Wells, Molly, Collins, Gary S., Riley, Richard D.
Format:	Preprint
Published:	2026
Subjects:	Methodology
Online Access:	https://arxiv.org/abs/2605.07312
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911661769621504
author	Martin, Glen P. Bladon, Sian Whittle, Rebecca Wells, Molly Collins, Gary S. Riley, Richard D.
author_facet	Martin, Glen P. Bladon, Sian Whittle, Rebecca Wells, Molly Collins, Gary S. Riley, Richard D.
contents	Clinical prediction models must be developed using sufficiently large datasets to minimise overfitting and ensure robust predictive performance. Existing sample size calculations assume complete predictor data for all included participants, yet missing values are common and may increase required sample sizes. This study aimed to quantify how missing predictor data and different imputation methods affect overfitting and model degradation, within datasets that adhere to current sample size criteria. We also aimed to explore how a general sample size framework based on anticipated posterior (sampling) distributions can be adapted to incorporate missing data assumptions and handling strategies. Using a simulation study, we found that in development data meeting current minimum sample size requirements, missing data reduced predictive performance, with expected calibration slopes frequently falling below the targeted value of 0.9. Increasing the required sample size to account for missing data reduced overfitting concerns, but the necessary inflation factor was context specific. In some scenarios, up to twice the minimum sample size was needed to achieve performance comparable to models developed using fully observed data. Expected value of perfect information calculations allowed quantification of the expected loss due to finite samples and missingness. Through two applied examples, we illustrate how embedding missing data assumptions and handling within the posterior sampling approach provides a principled way to determine required minimum sample sizes under missing data. Overall, missing predictor data increases minimum sample size requirements to develop stable and well-calibrated models. Our adaptations to recent posterior (sampling) sample size calculations offer a practical approach for incorporating missing data directly into sample size calculations.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_07312
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Incorporating Missing Data Considerations into Sample Size Calculations for Developing Clinical Prediction Models Martin, Glen P. Bladon, Sian Whittle, Rebecca Wells, Molly Collins, Gary S. Riley, Richard D. Methodology Clinical prediction models must be developed using sufficiently large datasets to minimise overfitting and ensure robust predictive performance. Existing sample size calculations assume complete predictor data for all included participants, yet missing values are common and may increase required sample sizes. This study aimed to quantify how missing predictor data and different imputation methods affect overfitting and model degradation, within datasets that adhere to current sample size criteria. We also aimed to explore how a general sample size framework based on anticipated posterior (sampling) distributions can be adapted to incorporate missing data assumptions and handling strategies. Using a simulation study, we found that in development data meeting current minimum sample size requirements, missing data reduced predictive performance, with expected calibration slopes frequently falling below the targeted value of 0.9. Increasing the required sample size to account for missing data reduced overfitting concerns, but the necessary inflation factor was context specific. In some scenarios, up to twice the minimum sample size was needed to achieve performance comparable to models developed using fully observed data. Expected value of perfect information calculations allowed quantification of the expected loss due to finite samples and missingness. Through two applied examples, we illustrate how embedding missing data assumptions and handling within the posterior sampling approach provides a principled way to determine required minimum sample sizes under missing data. Overall, missing predictor data increases minimum sample size requirements to develop stable and well-calibrated models. Our adaptations to recent posterior (sampling) sample size calculations offer a practical approach for incorporating missing data directly into sample size calculations.
title	Incorporating Missing Data Considerations into Sample Size Calculations for Developing Clinical Prediction Models
topic	Methodology
url	https://arxiv.org/abs/2605.07312

Similar Items