Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Alzahrani, Reem, Alshanqiti, Hassan, Hemid, Bushra Bin, Alyafeai, Zaid, Eldesokey, Abdelrahman, Ghanem, Bernard
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.17826
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917506617180160
author	Alzahrani, Reem Alshanqiti, Hassan Hemid, Bushra Bin Alyafeai, Zaid Eldesokey, Abdelrahman Ghanem, Bernard
author_facet	Alzahrani, Reem Alshanqiti, Hassan Hemid, Bushra Bin Alyafeai, Zaid Eldesokey, Abdelrahman Ghanem, Bernard
contents	Vision-Language Models (VLMs) excel at multimodal reasoning, yet it remains unclear whether their answers are grounded in visual evidence or driven by learned language and world priors. Counting provides a precise testbed: when visual evidence conflicts with canonical object knowledge, a model must rely on the image rather than a prototypical count. We introduce CounterCount, a diagnostic framework for counterfactual counting in VLMs, consisting of paired factual and counterfactual images with edited count-relevant attributes, verified answers, and localized evidence annotations. Evaluating recent VLMs, we find strong performance on factual images but consistent degradation under counterfactual attribute changes, indicating reliance on object-level priors even when contradictory visual evidence is present. Using localized annotations, we show that these failures are not solely due to missing or ambiguous visual evidence, but to models underweighting attention to count-relevant visual tokens. We introduce a unified inference-time attention modulation strategy that reweights selected visual tokens, improving counterfactual counting accuracy by up to 8% across multiple VLMs. Overall, CounterCount exposes prior-driven counting failures and provides diagnostic insights for designing future VLMs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_17826
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models Alzahrani, Reem Alshanqiti, Hassan Hemid, Bushra Bin Alyafeai, Zaid Eldesokey, Abdelrahman Ghanem, Bernard Computer Vision and Pattern Recognition Artificial Intelligence Vision-Language Models (VLMs) excel at multimodal reasoning, yet it remains unclear whether their answers are grounded in visual evidence or driven by learned language and world priors. Counting provides a precise testbed: when visual evidence conflicts with canonical object knowledge, a model must rely on the image rather than a prototypical count. We introduce CounterCount, a diagnostic framework for counterfactual counting in VLMs, consisting of paired factual and counterfactual images with edited count-relevant attributes, verified answers, and localized evidence annotations. Evaluating recent VLMs, we find strong performance on factual images but consistent degradation under counterfactual attribute changes, indicating reliance on object-level priors even when contradictory visual evidence is present. Using localized annotations, we show that these failures are not solely due to missing or ambiguous visual evidence, but to models underweighting attention to count-relevant visual tokens. We introduce a unified inference-time attention modulation strategy that reweights selected visual tokens, improving counterfactual counting accuracy by up to 8% across multiple VLMs. Overall, CounterCount exposes prior-driven counting failures and provides diagnostic insights for designing future VLMs.
title	CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2605.17826

Similar Items