Saved in:
Bibliographic Details
Main Authors: Shakeel, Rozain, Ali, Abdul Rahman Mohammad, Mushtaq, Muneeb, Saleem, Tausifa Jan, Ashraf, Tajamul
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.19993
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911531811209216
author Shakeel, Rozain
Ali, Abdul Rahman Mohammad
Mushtaq, Muneeb
Saleem, Tausifa Jan
Ashraf, Tajamul
author_facet Shakeel, Rozain
Ali, Abdul Rahman Mohammad
Mushtaq, Muneeb
Saleem, Tausifa Jan
Ashraf, Tajamul
contents Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: https://github.com/Tajamul21/MedSPOT.
format Preprint
id arxiv_https___arxiv_org_abs_2603_19993
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI
Shakeel, Rozain
Ali, Abdul Rahman Mohammad
Mushtaq, Muneeb
Saleem, Tausifa Jan
Ashraf, Tajamul
Computer Vision and Pattern Recognition
Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: https://github.com/Tajamul21/MedSPOT.
title MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.19993