Saved in:
Bibliographic Details
Main Authors: Stegeman, Michelle, Philipp, Lena, van der Graaf, Fennie, D'Amato, Marina, Grisi, Clément, Builtjes, Luc, Bosma, Joeran S., Lefkes, Judith, Weber, Rianne A., Meakin, James A., Koopman, Thomas, Mickan, Anne, Prokop, Mathias, Smit, Ewoud J., Litjens, Geert, van der Laak, Jeroen, van Ginneken, Bram, de Rooij, Maarten, Huisman, Henkjan, Jacobs, Colin, Ciompi, Francesco, Hering, Alessa
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.02790
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914365032103936
author Stegeman, Michelle
Philipp, Lena
van der Graaf, Fennie
D'Amato, Marina
Grisi, Clément
Builtjes, Luc
Bosma, Joeran S.
Lefkes, Judith
Weber, Rianne A.
Meakin, James A.
Koopman, Thomas
Mickan, Anne
Prokop, Mathias
Smit, Ewoud J.
Litjens, Geert
van der Laak, Jeroen
van Ginneken, Bram
de Rooij, Maarten
Huisman, Henkjan
Jacobs, Colin
Ciompi, Francesco
Hering, Alessa
author_facet Stegeman, Michelle
Philipp, Lena
van der Graaf, Fennie
D'Amato, Marina
Grisi, Clément
Builtjes, Luc
Bosma, Joeran S.
Lefkes, Judith
Weber, Rianne A.
Meakin, James A.
Koopman, Thomas
Mickan, Anne
Prokop, Mathias
Smit, Ewoud J.
Litjens, Geert
van der Laak, Jeroen
van Ginneken, Bram
de Rooij, Maarten
Huisman, Henkjan
Jacobs, Colin
Ciompi, Francesco
Hering, Alessa
contents Medical foundation models show promise to learn broadly generalizable features from large, diverse datasets. This could be the base for reliable cross-modality generalization and rapid adaptation to new, task-specific goals, with only a few task-specific examples. Yet, evidence for this is limited by the lack of public, standardized, and reproducible evaluation frameworks, as existing public benchmarks are often fragmented across task-, organ-, or modality-specific settings, limiting assessment of cross-task generalization. We introduce UNICORN, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol. To isolate representation quality, we built the benchmark on a novel two-step framework that decouples model inference from task-specific evaluation based on standardized few-shot adaptation. As a central design choice, we constructed indirectly accessible sequestered test sets derived from clinically relevant cohorts, along with standardized evaluation code and a submission interface on an open benchmarking platform. Performance is aggregated into a single UNICORN Score, a new metric that we introduce to support direct comparison of foundation models across diverse medical domains, modalities, and task types. The UNICORN test dataset includes data from more than 2,400 patients, including over 3,700 vision cases and over 2,400 clinical reports collected from 17 institutions across eight countries. The benchmark spans eight anatomical regions and four imaging modalities. Both task-specific and aggregated leaderboards enable accessible, standardized, and reproducible evaluation. By standardizing multi-task, multi-modality assessment, UNICORN establishes a foundation for reproducible benchmarking of medical foundation models. Data, baseline methods, and the evaluation platform are publicly available via unicorn.grand-challenge.org.
format Preprint
id arxiv_https___arxiv_org_abs_2603_02790
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language
Stegeman, Michelle
Philipp, Lena
van der Graaf, Fennie
D'Amato, Marina
Grisi, Clément
Builtjes, Luc
Bosma, Joeran S.
Lefkes, Judith
Weber, Rianne A.
Meakin, James A.
Koopman, Thomas
Mickan, Anne
Prokop, Mathias
Smit, Ewoud J.
Litjens, Geert
van der Laak, Jeroen
van Ginneken, Bram
de Rooij, Maarten
Huisman, Henkjan
Jacobs, Colin
Ciompi, Francesco
Hering, Alessa
Computer Vision and Pattern Recognition
I.4.0
Medical foundation models show promise to learn broadly generalizable features from large, diverse datasets. This could be the base for reliable cross-modality generalization and rapid adaptation to new, task-specific goals, with only a few task-specific examples. Yet, evidence for this is limited by the lack of public, standardized, and reproducible evaluation frameworks, as existing public benchmarks are often fragmented across task-, organ-, or modality-specific settings, limiting assessment of cross-task generalization. We introduce UNICORN, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol. To isolate representation quality, we built the benchmark on a novel two-step framework that decouples model inference from task-specific evaluation based on standardized few-shot adaptation. As a central design choice, we constructed indirectly accessible sequestered test sets derived from clinically relevant cohorts, along with standardized evaluation code and a submission interface on an open benchmarking platform. Performance is aggregated into a single UNICORN Score, a new metric that we introduce to support direct comparison of foundation models across diverse medical domains, modalities, and task types. The UNICORN test dataset includes data from more than 2,400 patients, including over 3,700 vision cases and over 2,400 clinical reports collected from 17 institutions across eight countries. The benchmark spans eight anatomical regions and four imaging modalities. Both task-specific and aggregated leaderboards enable accessible, standardized, and reproducible evaluation. By standardizing multi-task, multi-modality assessment, UNICORN establishes a foundation for reproducible benchmarking of medical foundation models. Data, baseline methods, and the evaluation platform are publicly available via unicorn.grand-challenge.org.
title Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language
topic Computer Vision and Pattern Recognition
I.4.0
url https://arxiv.org/abs/2603.02790