Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Pan, Jiazhen, Shen, Weixiang, Li, Jun, Canisius, Julian, Bitzer, Felix, Roßmüller, Paula, Yang, Jiancheng, Kreutzinger, Virginie, Rueckert, Daniel, Wiestler, Benedikt
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.23629
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910248141324288
author	Pan, Jiazhen Shen, Weixiang Li, Jun Canisius, Julian Bitzer, Felix Roßmüller, Paula Yang, Jiancheng Kreutzinger, Virginie Rueckert, Daniel Wiestler, Benedikt
author_facet	Pan, Jiazhen Shen, Weixiang Li, Jun Canisius, Julian Bitzer, Felix Roßmüller, Paula Yang, Jiancheng Kreutzinger, Virginie Rueckert, Daniel Wiestler, Benedikt
contents	Medical diagnosis is not a single prediction from a fully specified vignette. It is a sequential workup: clinicians decide what evidence to obtain, revise a differential diagnosis, and stop when the diagnosis is sufficiently supported. Most medical AI benchmarks instead reveal the relevant context upfront and score only the final answer, making unsupported correct guesses, premature closure, inefficient workups, and poor uncertainty updating invisible. We introduce DDX-TRACE, a physician-adjudicated benchmark for multimodal neuroradiology that evaluates diagnostic trajectories under hidden evidence over 211 challenging cases. Each case begins with limited clinical history; models request imaging studies in free form, receive matched image bundles when available, update a probabilistic differential diagnosis after each turn, and stop with a localized final diagnosis. Evaluating state-of-the-art VLMs, we find that final diagnosis scores can substantially misrepresent workup quality: models may guess plausible diagnoses without essential evidence, request useful studies but misinterpret raw images, or acquire evidence inefficiently while updating uncertainty poorly. Controlled evidence variants isolate bottlenecks in planning, visual evidence extraction, and downstream differential reasoning. DDX-TRACE shifts medical AI evaluation from final answers to evidence-supported diagnostic trajectories.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_23629
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs Pan, Jiazhen Shen, Weixiang Li, Jun Canisius, Julian Bitzer, Felix Roßmüller, Paula Yang, Jiancheng Kreutzinger, Virginie Rueckert, Daniel Wiestler, Benedikt Computer Vision and Pattern Recognition Medical diagnosis is not a single prediction from a fully specified vignette. It is a sequential workup: clinicians decide what evidence to obtain, revise a differential diagnosis, and stop when the diagnosis is sufficiently supported. Most medical AI benchmarks instead reveal the relevant context upfront and score only the final answer, making unsupported correct guesses, premature closure, inefficient workups, and poor uncertainty updating invisible. We introduce DDX-TRACE, a physician-adjudicated benchmark for multimodal neuroradiology that evaluates diagnostic trajectories under hidden evidence over 211 challenging cases. Each case begins with limited clinical history; models request imaging studies in free form, receive matched image bundles when available, update a probabilistic differential diagnosis after each turn, and stop with a localized final diagnosis. Evaluating state-of-the-art VLMs, we find that final diagnosis scores can substantially misrepresent workup quality: models may guess plausible diagnoses without essential evidence, request useful studies but misinterpret raw images, or acquire evidence inefficiently while updating uncertainty poorly. Controlled evidence variants isolate bottlenecks in planning, visual evidence extraction, and downstream differential reasoning. DDX-TRACE shifts medical AI evaluation from final answers to evidence-supported diagnostic trajectories.
title	DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2605.23629

Similar Items