Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Asadi, Mohammad, O'Sullivan, Jack W., Cao, Fang, Nedaee, Tahoura, Rajabalifardi, Kamyar, Li, Fei-Fei, Adeli, Ehsan, Ashley, Euan
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.21687
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914439073103872
author	Asadi, Mohammad O'Sullivan, Jack W. Cao, Fang Nedaee, Tahoura Rajabalifardi, Kamyar Li, Fei-Fei Adeli, Ehsan Ashley, Euan
author_facet	Asadi, Mohammad O'Sullivan, Jack W. Cao, Fang Nedaee, Tahoura Rajabalifardi, Kamyar Li, Fei-Fei Adeli, Ehsan Ashley, Euan
contents	Multimodal AI systems have achieved remarkable performance across a broad range of real-world tasks, yet the mechanisms underlying visual-language reasoning remain surprisingly poorly understood. We report three findings that challenge prevailing assumptions about how these systems process and integrate visual information. First, Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided; we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. Third, when models were explicitly instructed to guess answers without image access, rather than being implicitly prompted to assume images were present, performance declined markedly. Explicit guessing appears to engage a more conservative response regime, in contrast to the mirage regime in which models behave as though images have been provided. These findings expose fundamental vulnerabilities in how visual-language models reason and are evaluated, pointing to an urgent need for private benchmarks that eliminate textual cues enabling non-visual inference, particularly in medical contexts where miscalibrated AI carries the greatest consequence. We introduce B-Clean as a principled solution for fair, vision-grounded evaluation of multimodal AI systems.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_21687
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	MIRAGE: The Illusion of Visual Understanding Asadi, Mohammad O'Sullivan, Jack W. Cao, Fang Nedaee, Tahoura Rajabalifardi, Kamyar Li, Fei-Fei Adeli, Ehsan Ashley, Euan Artificial Intelligence Multimodal AI systems have achieved remarkable performance across a broad range of real-world tasks, yet the mechanisms underlying visual-language reasoning remain surprisingly poorly understood. We report three findings that challenge prevailing assumptions about how these systems process and integrate visual information. First, Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided; we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. Third, when models were explicitly instructed to guess answers without image access, rather than being implicitly prompted to assume images were present, performance declined markedly. Explicit guessing appears to engage a more conservative response regime, in contrast to the mirage regime in which models behave as though images have been provided. These findings expose fundamental vulnerabilities in how visual-language models reason and are evaluated, pointing to an urgent need for private benchmarks that eliminate textual cues enabling non-visual inference, particularly in medical contexts where miscalibrated AI carries the greatest consequence. We introduce B-Clean as a principled solution for fair, vision-grounded evaluation of multimodal AI systems.
title	MIRAGE: The Illusion of Visual Understanding
topic	Artificial Intelligence
url	https://arxiv.org/abs/2603.21687

Similar Items