Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Khezresmaeilzadeh, Tina, Zhong, Jike, Psounis, Konstantinos
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2602.05382
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915777621262336
author	Khezresmaeilzadeh, Tina Zhong, Jike Psounis, Konstantinos
author_facet	Khezresmaeilzadeh, Tina Zhong, Jike Psounis, Konstantinos
contents	Recent progress in Vision Language Models (VLMs) has raised the question of whether they can reliably perform nonverbal reasoning. To this end, we introduce VRIQ (Visual Reasoning IQ), a novel benchmark designed to assess and analyze the visual reasoning ability of VLMs. We evaluate models on two sets of tasks: abstract puzzle-style and natural-image reasoning tasks. We find that on abstract puzzles, performance remains near random with an average accuracy of around 28%, while natural tasks yield better but still weak results with 45% accuracy. We also find that tool-augmented reasoning demonstrates only modest improvements. To uncover the source of this weakness, we introduce diagnostic probes targeting perception and reasoning. Our analysis demonstrates that around 56% of failures arise from perception alone, 43% from both perception and reasoning, and only a mere 1% from reasoning alone. This motivates us to design fine-grained diagnostic probe questions targeting specific perception categories (e.g., shape, count, position, 3D/depth), revealing that certain categories cause more failures than others. Our benchmark and analysis establish that current VLMs, even with visual reasoning tools, remain unreliable abstract reasoners, mostly due to perception limitations, and offer a principled basis for improving visual reasoning in multimodal systems.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_05382
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs Khezresmaeilzadeh, Tina Zhong, Jike Psounis, Konstantinos Computer Vision and Pattern Recognition Machine Learning Recent progress in Vision Language Models (VLMs) has raised the question of whether they can reliably perform nonverbal reasoning. To this end, we introduce VRIQ (Visual Reasoning IQ), a novel benchmark designed to assess and analyze the visual reasoning ability of VLMs. We evaluate models on two sets of tasks: abstract puzzle-style and natural-image reasoning tasks. We find that on abstract puzzles, performance remains near random with an average accuracy of around 28%, while natural tasks yield better but still weak results with 45% accuracy. We also find that tool-augmented reasoning demonstrates only modest improvements. To uncover the source of this weakness, we introduce diagnostic probes targeting perception and reasoning. Our analysis demonstrates that around 56% of failures arise from perception alone, 43% from both perception and reasoning, and only a mere 1% from reasoning alone. This motivates us to design fine-grained diagnostic probe questions targeting specific perception categories (e.g., shape, count, position, 3D/depth), revealing that certain categories cause more failures than others. Our benchmark and analysis establish that current VLMs, even with visual reasoning tools, remain unreliable abstract reasoners, mostly due to perception limitations, and offer a principled basis for improving visual reasoning in multimodal systems.
title	VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs
topic	Computer Vision and Pattern Recognition Machine Learning
url	https://arxiv.org/abs/2602.05382

Similar Items