Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.23629 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866910248141324288 |
|---|---|
| author | Pan, Jiazhen Shen, Weixiang Li, Jun Canisius, Julian Bitzer, Felix Roßmüller, Paula Yang, Jiancheng Kreutzinger, Virginie Rueckert, Daniel Wiestler, Benedikt |
| author_facet | Pan, Jiazhen Shen, Weixiang Li, Jun Canisius, Julian Bitzer, Felix Roßmüller, Paula Yang, Jiancheng Kreutzinger, Virginie Rueckert, Daniel Wiestler, Benedikt |
| contents | Medical diagnosis is not a single prediction from a fully specified vignette. It is a sequential workup: clinicians decide what evidence to obtain, revise a differential diagnosis, and stop when the diagnosis is sufficiently supported. Most medical AI benchmarks instead reveal the relevant context upfront and score only the final answer, making unsupported correct guesses, premature closure, inefficient workups, and poor uncertainty updating invisible. We introduce DDX-TRACE, a physician-adjudicated benchmark for multimodal neuroradiology that evaluates diagnostic trajectories under hidden evidence over 211 challenging cases. Each case begins with limited clinical history; models request imaging studies in free form, receive matched image bundles when available, update a probabilistic differential diagnosis after each turn, and stop with a localized final diagnosis. Evaluating state-of-the-art VLMs, we find that final diagnosis scores can substantially misrepresent workup quality: models may guess plausible diagnoses without essential evidence, request useful studies but misinterpret raw images, or acquire evidence inefficiently while updating uncertainty poorly. Controlled evidence variants isolate bottlenecks in planning, visual evidence extraction, and downstream differential reasoning. DDX-TRACE shifts medical AI evaluation from final answers to evidence-supported diagnostic trajectories. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2605_23629 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs Pan, Jiazhen Shen, Weixiang Li, Jun Canisius, Julian Bitzer, Felix Roßmüller, Paula Yang, Jiancheng Kreutzinger, Virginie Rueckert, Daniel Wiestler, Benedikt Computer Vision and Pattern Recognition Medical diagnosis is not a single prediction from a fully specified vignette. It is a sequential workup: clinicians decide what evidence to obtain, revise a differential diagnosis, and stop when the diagnosis is sufficiently supported. Most medical AI benchmarks instead reveal the relevant context upfront and score only the final answer, making unsupported correct guesses, premature closure, inefficient workups, and poor uncertainty updating invisible. We introduce DDX-TRACE, a physician-adjudicated benchmark for multimodal neuroradiology that evaluates diagnostic trajectories under hidden evidence over 211 challenging cases. Each case begins with limited clinical history; models request imaging studies in free form, receive matched image bundles when available, update a probabilistic differential diagnosis after each turn, and stop with a localized final diagnosis. Evaluating state-of-the-art VLMs, we find that final diagnosis scores can substantially misrepresent workup quality: models may guess plausible diagnoses without essential evidence, request useful studies but misinterpret raw images, or acquire evidence inefficiently while updating uncertainty poorly. Controlled evidence variants isolate bottlenecks in planning, visual evidence extraction, and downstream differential reasoning. DDX-TRACE shifts medical AI evaluation from final answers to evidence-supported diagnostic trajectories. |
| title | DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2605.23629 |