Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shakeel, Rozain, Ali, Abdul Rahman Mohammad, Mushtaq, Muneeb, Saleem, Tausifa Jan, Ashraf, Tajamul
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.19993
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911531811209216
author	Shakeel, Rozain Ali, Abdul Rahman Mohammad Mushtaq, Muneeb Saleem, Tausifa Jan Ashraf, Tajamul
author_facet	Shakeel, Rozain Ali, Abdul Rahman Mohammad Mushtaq, Muneeb Saleem, Tausifa Jan Ashraf, Tajamul
contents	Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: https://github.com/Tajamul21/MedSPOT.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_19993
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI Shakeel, Rozain Ali, Abdul Rahman Mohammad Mushtaq, Muneeb Saleem, Tausifa Jan Ashraf, Tajamul Computer Vision and Pattern Recognition Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: https://github.com/Tajamul21/MedSPOT.
title	MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2603.19993

Similar Items