Saved in:
Bibliographic Details
Main Authors: Pan, Jiazhen, Shen, Weixiang, Li, Jun, Canisius, Julian, Bitzer, Felix, Roßmüller, Paula, Yang, Jiancheng, Kreutzinger, Virginie, Rueckert, Daniel, Wiestler, Benedikt
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.23629
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910248141324288
author Pan, Jiazhen
Shen, Weixiang
Li, Jun
Canisius, Julian
Bitzer, Felix
Roßmüller, Paula
Yang, Jiancheng
Kreutzinger, Virginie
Rueckert, Daniel
Wiestler, Benedikt
author_facet Pan, Jiazhen
Shen, Weixiang
Li, Jun
Canisius, Julian
Bitzer, Felix
Roßmüller, Paula
Yang, Jiancheng
Kreutzinger, Virginie
Rueckert, Daniel
Wiestler, Benedikt
contents Medical diagnosis is not a single prediction from a fully specified vignette. It is a sequential workup: clinicians decide what evidence to obtain, revise a differential diagnosis, and stop when the diagnosis is sufficiently supported. Most medical AI benchmarks instead reveal the relevant context upfront and score only the final answer, making unsupported correct guesses, premature closure, inefficient workups, and poor uncertainty updating invisible. We introduce DDX-TRACE, a physician-adjudicated benchmark for multimodal neuroradiology that evaluates diagnostic trajectories under hidden evidence over 211 challenging cases. Each case begins with limited clinical history; models request imaging studies in free form, receive matched image bundles when available, update a probabilistic differential diagnosis after each turn, and stop with a localized final diagnosis. Evaluating state-of-the-art VLMs, we find that final diagnosis scores can substantially misrepresent workup quality: models may guess plausible diagnoses without essential evidence, request useful studies but misinterpret raw images, or acquire evidence inefficiently while updating uncertainty poorly. Controlled evidence variants isolate bottlenecks in planning, visual evidence extraction, and downstream differential reasoning. DDX-TRACE shifts medical AI evaluation from final answers to evidence-supported diagnostic trajectories.
format Preprint
id arxiv_https___arxiv_org_abs_2605_23629
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs
Pan, Jiazhen
Shen, Weixiang
Li, Jun
Canisius, Julian
Bitzer, Felix
Roßmüller, Paula
Yang, Jiancheng
Kreutzinger, Virginie
Rueckert, Daniel
Wiestler, Benedikt
Computer Vision and Pattern Recognition
Medical diagnosis is not a single prediction from a fully specified vignette. It is a sequential workup: clinicians decide what evidence to obtain, revise a differential diagnosis, and stop when the diagnosis is sufficiently supported. Most medical AI benchmarks instead reveal the relevant context upfront and score only the final answer, making unsupported correct guesses, premature closure, inefficient workups, and poor uncertainty updating invisible. We introduce DDX-TRACE, a physician-adjudicated benchmark for multimodal neuroradiology that evaluates diagnostic trajectories under hidden evidence over 211 challenging cases. Each case begins with limited clinical history; models request imaging studies in free form, receive matched image bundles when available, update a probabilistic differential diagnosis after each turn, and stop with a localized final diagnosis. Evaluating state-of-the-art VLMs, we find that final diagnosis scores can substantially misrepresent workup quality: models may guess plausible diagnoses without essential evidence, request useful studies but misinterpret raw images, or acquire evidence inefficiently while updating uncertainty poorly. Controlled evidence variants isolate bottlenecks in planning, visual evidence extraction, and downstream differential reasoning. DDX-TRACE shifts medical AI evaluation from final answers to evidence-supported diagnostic trajectories.
title DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2605.23629