Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Basappa, Aahana, Goel, Pranay, Karra, Anusri, Karra, Anish, Gilmore, Asa, Zhu, Kevin
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.17037
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914276742004736
author	Basappa, Aahana Goel, Pranay Karra, Anusri Karra, Anish Gilmore, Asa Zhu, Kevin
author_facet	Basappa, Aahana Goel, Pranay Karra, Anusri Karra, Anish Gilmore, Asa Zhu, Kevin
contents	We investigated visual reasoning limitations of both multimodal large language models (MLLMs) and image generation models (IGMs) by creating a novel benchmark to systematically compare failure modes across image-to-text and text-to-image tasks, enabling cross-modal evaluation of visual understanding. Despite rapid growth in machine learning, vision language models (VLMs) still fail to understand or generate basic visual concepts such as object orientation, quantity, or spatial relationships, which highlighted gaps in elementary visual reasoning. By adapting MMVP benchmark questions into explicit and implicit prompts, we create \textit{AMVICC}, a novel benchmark for profiling failure modes across various modalities. After testing 11 MLLMs and 3 IGMs in nine categories of visual reasoning, our results show that failure modes are often shared between models and modalities, but certain failures are model-specific and modality-specific, and this can potentially be attributed to various factors. IGMs consistently struggled to manipulate specific visual components in response to prompts, especially in explicit prompts, suggesting poor control over fine-grained visual attributes. Our findings apply most directly to the evaluation of existing state-of-the-art models on structured visual reasoning tasks. This work lays the foundation for future cross-modal alignment studies, offering a framework to probe whether generation and interpretation failures stem from shared limitations to guide future improvements in unified vision-language modeling.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_17037
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs Basappa, Aahana Goel, Pranay Karra, Anusri Karra, Anish Gilmore, Asa Zhu, Kevin Computer Vision and Pattern Recognition Artificial Intelligence We investigated visual reasoning limitations of both multimodal large language models (MLLMs) and image generation models (IGMs) by creating a novel benchmark to systematically compare failure modes across image-to-text and text-to-image tasks, enabling cross-modal evaluation of visual understanding. Despite rapid growth in machine learning, vision language models (VLMs) still fail to understand or generate basic visual concepts such as object orientation, quantity, or spatial relationships, which highlighted gaps in elementary visual reasoning. By adapting MMVP benchmark questions into explicit and implicit prompts, we create \textit{AMVICC}, a novel benchmark for profiling failure modes across various modalities. After testing 11 MLLMs and 3 IGMs in nine categories of visual reasoning, our results show that failure modes are often shared between models and modalities, but certain failures are model-specific and modality-specific, and this can potentially be attributed to various factors. IGMs consistently struggled to manipulate specific visual components in response to prompts, especially in explicit prompts, suggesting poor control over fine-grained visual attributes. Our findings apply most directly to the evaluation of existing state-of-the-art models on structured visual reasoning tasks. This work lays the foundation for future cross-modal alignment studies, offering a framework to probe whether generation and interpretation failures stem from shared limitations to guide future improvements in unified vision-language modeling.
title	AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2601.17037

Similar Items