Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	André, Pascaline, Heitz, Charles, Christodoulou, Evangelia, Reinke, Annika, Sudre, Carole H., Antonelli, Michela, Godau, Patrick, Cardoso, M. Jorge, Gilson, Antoine, Montcel, Sophie Tezenas du, Varoquaux, Gaël, Maier-Hein, Lena, Colliot, Olivier
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2601.17103
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911395708141568
author	André, Pascaline Heitz, Charles Christodoulou, Evangelia Reinke, Annika Sudre, Carole H. Antonelli, Michela Godau, Patrick Cardoso, M. Jorge Gilson, Antoine Montcel, Sophie Tezenas du Varoquaux, Gaël Maier-Hein, Lena Colliot, Olivier
author_facet	André, Pascaline Heitz, Charles Christodoulou, Evangelia Reinke, Annika Sudre, Carole H. Antonelli, Michela Godau, Patrick Cardoso, M. Jorge Gilson, Antoine Montcel, Sophie Tezenas du Varoquaux, Gaël Maier-Hein, Lena Colliot, Olivier
contents	Performance uncertainty quantification is essential for reliable validation and eventual clinical translation of medical imaging artificial intelligence (AI). Confidence intervals (CIs) play a central role in this process by indicating how precise a reported performance estimate is. Yet, due to the limited amount of work examining CI behavior in medical imaging, the community remains largely unaware of how many diverse CI methods exist and how they behave in specific settings. The purpose of this study is to close this gap. To this end, we conducted a large-scale empirical analysis across a total of 24 segmentation and classification tasks, using 19 trained models per task group, a broad spectrum of commonly used performance metrics, multiple aggregation strategies, and several widely adopted CI methods. Reliability (coverage) and precision (width) of each CI method were estimated across all settings to characterize their dependence on study characteristics. Our analysis revealed five principal findings: 1) the sample size required for reliable CIs varies from a few dozens to several thousands of cases depending on study parameters; 2) CI behavior is strongly affected by the choice of performance metric; 3) aggregation strategy substantially influences the reliability of CIs, e.g. they require more observations for macro than for micro; 4) the machine learning problem (segmentation versus classification) modulates these effects; 5) different CI methods are not equally reliable and precise depending on the use case. These results form key components for the development of future guidelines on reporting performance uncertainty in medical imaging AI.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_17103
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Performance uncertainty in medical image analysis: a large-scale investigation of confidence intervals André, Pascaline Heitz, Charles Christodoulou, Evangelia Reinke, Annika Sudre, Carole H. Antonelli, Michela Godau, Patrick Cardoso, M. Jorge Gilson, Antoine Montcel, Sophie Tezenas du Varoquaux, Gaël Maier-Hein, Lena Colliot, Olivier Computer Vision and Pattern Recognition Machine Learning Performance uncertainty quantification is essential for reliable validation and eventual clinical translation of medical imaging artificial intelligence (AI). Confidence intervals (CIs) play a central role in this process by indicating how precise a reported performance estimate is. Yet, due to the limited amount of work examining CI behavior in medical imaging, the community remains largely unaware of how many diverse CI methods exist and how they behave in specific settings. The purpose of this study is to close this gap. To this end, we conducted a large-scale empirical analysis across a total of 24 segmentation and classification tasks, using 19 trained models per task group, a broad spectrum of commonly used performance metrics, multiple aggregation strategies, and several widely adopted CI methods. Reliability (coverage) and precision (width) of each CI method were estimated across all settings to characterize their dependence on study characteristics. Our analysis revealed five principal findings: 1) the sample size required for reliable CIs varies from a few dozens to several thousands of cases depending on study parameters; 2) CI behavior is strongly affected by the choice of performance metric; 3) aggregation strategy substantially influences the reliability of CIs, e.g. they require more observations for macro than for micro; 4) the machine learning problem (segmentation versus classification) modulates these effects; 5) different CI methods are not equally reliable and precise depending on the use case. These results form key components for the development of future guidelines on reporting performance uncertainty in medical imaging AI.
title	Performance uncertainty in medical image analysis: a large-scale investigation of confidence intervals
topic	Computer Vision and Pattern Recognition Machine Learning
url	https://arxiv.org/abs/2601.17103

Similar Items