Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wallace, Conor, Corley, Isaac, Lwowski, Jonathan
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2508.01921
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908477180346368
author	Wallace, Conor Corley, Isaac Lwowski, Jonathan
author_facet	Wallace, Conor Corley, Isaac Lwowski, Jonathan
contents	Unified vision-language models (VLMs) promise to streamline computer vision pipelines by reframing multiple visual tasks such as classification, detection, and keypoint localization within a single language-driven interface. This architecture is particularly appealing in industrial inspection, where managing disjoint task-specific models introduces complexity, inefficiency, and maintenance overhead. In this paper, we critically evaluate the viability of this unified paradigm using InspectVLM, a Florence-2-based VLM trained on InspectMM, our new large-scale multimodal, multitask inspection dataset. While InspectVLM performs competitively on image-level classification and structured keypoint tasks, we find that it fails to match traditional ResNet-based models in core inspection metrics. Notably, the model exhibits brittle behavior under low prompt variability, produces degenerate outputs for fine-grained object detection, and frequently defaults to memorized language responses regardless of visual input. Our findings suggest that while language-driven unification offers conceptual elegance, current VLMs lack the visual grounding and robustness necessary for deployment in precision critical industrial inspections.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_01921
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	InspectVLM: Unified in Theory, Unreliable in Practice Wallace, Conor Corley, Isaac Lwowski, Jonathan Computer Vision and Pattern Recognition Unified vision-language models (VLMs) promise to streamline computer vision pipelines by reframing multiple visual tasks such as classification, detection, and keypoint localization within a single language-driven interface. This architecture is particularly appealing in industrial inspection, where managing disjoint task-specific models introduces complexity, inefficiency, and maintenance overhead. In this paper, we critically evaluate the viability of this unified paradigm using InspectVLM, a Florence-2-based VLM trained on InspectMM, our new large-scale multimodal, multitask inspection dataset. While InspectVLM performs competitively on image-level classification and structured keypoint tasks, we find that it fails to match traditional ResNet-based models in core inspection metrics. Notably, the model exhibits brittle behavior under low prompt variability, produces degenerate outputs for fine-grained object detection, and frequently defaults to memorized language responses regardless of visual input. Our findings suggest that while language-driven unification offers conceptual elegance, current VLMs lack the visual grounding and robustness necessary for deployment in precision critical industrial inspections.
title	InspectVLM: Unified in Theory, Unreliable in Practice
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2508.01921

Similar Items