Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Stegeman, Michelle, Philipp, Lena, van der Graaf, Fennie, D'Amato, Marina, Grisi, Clément, Builtjes, Luc, Bosma, Joeran S., Lefkes, Judith, Weber, Rianne A., Meakin, James A., Koopman, Thomas, Mickan, Anne, Prokop, Mathias, Smit, Ewoud J., Litjens, Geert, van der Laak, Jeroen, van Ginneken, Bram, de Rooij, Maarten, Huisman, Henkjan, Jacobs, Colin, Ciompi, Francesco, Hering, Alessa
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition I.4.0
Online Access:	https://arxiv.org/abs/2603.02790
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914365032103936
author	Stegeman, Michelle Philipp, Lena van der Graaf, Fennie D'Amato, Marina Grisi, Clément Builtjes, Luc Bosma, Joeran S. Lefkes, Judith Weber, Rianne A. Meakin, James A. Koopman, Thomas Mickan, Anne Prokop, Mathias Smit, Ewoud J. Litjens, Geert van der Laak, Jeroen van Ginneken, Bram de Rooij, Maarten Huisman, Henkjan Jacobs, Colin Ciompi, Francesco Hering, Alessa
author_facet	Stegeman, Michelle Philipp, Lena van der Graaf, Fennie D'Amato, Marina Grisi, Clément Builtjes, Luc Bosma, Joeran S. Lefkes, Judith Weber, Rianne A. Meakin, James A. Koopman, Thomas Mickan, Anne Prokop, Mathias Smit, Ewoud J. Litjens, Geert van der Laak, Jeroen van Ginneken, Bram de Rooij, Maarten Huisman, Henkjan Jacobs, Colin Ciompi, Francesco Hering, Alessa
contents	Medical foundation models show promise to learn broadly generalizable features from large, diverse datasets. This could be the base for reliable cross-modality generalization and rapid adaptation to new, task-specific goals, with only a few task-specific examples. Yet, evidence for this is limited by the lack of public, standardized, and reproducible evaluation frameworks, as existing public benchmarks are often fragmented across task-, organ-, or modality-specific settings, limiting assessment of cross-task generalization. We introduce UNICORN, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol. To isolate representation quality, we built the benchmark on a novel two-step framework that decouples model inference from task-specific evaluation based on standardized few-shot adaptation. As a central design choice, we constructed indirectly accessible sequestered test sets derived from clinically relevant cohorts, along with standardized evaluation code and a submission interface on an open benchmarking platform. Performance is aggregated into a single UNICORN Score, a new metric that we introduce to support direct comparison of foundation models across diverse medical domains, modalities, and task types. The UNICORN test dataset includes data from more than 2,400 patients, including over 3,700 vision cases and over 2,400 clinical reports collected from 17 institutions across eight countries. The benchmark spans eight anatomical regions and four imaging modalities. Both task-specific and aggregated leaderboards enable accessible, standardized, and reproducible evaluation. By standardizing multi-task, multi-modality assessment, UNICORN establishes a foundation for reproducible benchmarking of medical foundation models. Data, baseline methods, and the evaluation platform are publicly available via unicorn.grand-challenge.org.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_02790
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language Stegeman, Michelle Philipp, Lena van der Graaf, Fennie D'Amato, Marina Grisi, Clément Builtjes, Luc Bosma, Joeran S. Lefkes, Judith Weber, Rianne A. Meakin, James A. Koopman, Thomas Mickan, Anne Prokop, Mathias Smit, Ewoud J. Litjens, Geert van der Laak, Jeroen van Ginneken, Bram de Rooij, Maarten Huisman, Henkjan Jacobs, Colin Ciompi, Francesco Hering, Alessa Computer Vision and Pattern Recognition I.4.0 Medical foundation models show promise to learn broadly generalizable features from large, diverse datasets. This could be the base for reliable cross-modality generalization and rapid adaptation to new, task-specific goals, with only a few task-specific examples. Yet, evidence for this is limited by the lack of public, standardized, and reproducible evaluation frameworks, as existing public benchmarks are often fragmented across task-, organ-, or modality-specific settings, limiting assessment of cross-task generalization. We introduce UNICORN, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol. To isolate representation quality, we built the benchmark on a novel two-step framework that decouples model inference from task-specific evaluation based on standardized few-shot adaptation. As a central design choice, we constructed indirectly accessible sequestered test sets derived from clinically relevant cohorts, along with standardized evaluation code and a submission interface on an open benchmarking platform. Performance is aggregated into a single UNICORN Score, a new metric that we introduce to support direct comparison of foundation models across diverse medical domains, modalities, and task types. The UNICORN test dataset includes data from more than 2,400 patients, including over 3,700 vision cases and over 2,400 clinical reports collected from 17 institutions across eight countries. The benchmark spans eight anatomical regions and four imaging modalities. Both task-specific and aggregated leaderboards enable accessible, standardized, and reproducible evaluation. By standardizing multi-task, multi-modality assessment, UNICORN establishes a foundation for reproducible benchmarking of medical foundation models. Data, baseline methods, and the evaluation platform are publicly available via unicorn.grand-challenge.org.
title	Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language
topic	Computer Vision and Pattern Recognition I.4.0
url	https://arxiv.org/abs/2603.02790

Similar Items