Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Zhang, Chi, Ding, Wenxuan, Liu, Jiale, Wu, Mingrui, Wu, Qingyun, Mooney, Ray
Formato:	Preprint
Publicado:	2026
Materias:	Computation and Language
Acceso en línea:	https://arxiv.org/abs/2601.19202
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866910001938825216
author	Zhang, Chi Ding, Wenxuan Liu, Jiale Wu, Mingrui Wu, Qingyun Mooney, Ray
author_facet	Zhang, Chi Ding, Wenxuan Liu, Jiale Wu, Mingrui Wu, Qingyun Mooney, Ray
contents	Vision-Language Models (VLMs) have shown strong multimodal reasoning capabilities on Visual-Question-Answering (VQA) benchmarks. However, their robustness against textual misinformation remains under-explored. While existing research has studied the effect of misinformation in text-only domains, it is not clear how VLMs arbitrate between contradictory information from different modalities. To bridge the gap, we first propose the CONTEXT-VQA (i.e., Conflicting Text) dataset, consisting of image-question pairs together with systematically generated persuasive prompts that deliberately conflict with visual evidence. Then, a thorough evaluation framework is designed and executed to benchmark the susceptibility of various models to these conflicting multimodal inputs. Comprehensive experiments over 11 state-of-the-art VLMs reveal that these models are indeed vulnerable to misleading textual prompts, often overriding clear visual evidence in favor of the conflicting text, and show an average performance drop of over 48.2% after only one round of persuasive conversation. Our findings highlight a critical limitation in current VLMs and underscore the need for improved robustness against textual manipulation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_19202
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs Zhang, Chi Ding, Wenxuan Liu, Jiale Wu, Mingrui Wu, Qingyun Mooney, Ray Computation and Language Vision-Language Models (VLMs) have shown strong multimodal reasoning capabilities on Visual-Question-Answering (VQA) benchmarks. However, their robustness against textual misinformation remains under-explored. While existing research has studied the effect of misinformation in text-only domains, it is not clear how VLMs arbitrate between contradictory information from different modalities. To bridge the gap, we first propose the CONTEXT-VQA (i.e., Conflicting Text) dataset, consisting of image-question pairs together with systematically generated persuasive prompts that deliberately conflict with visual evidence. Then, a thorough evaluation framework is designed and executed to benchmark the susceptibility of various models to these conflicting multimodal inputs. Comprehensive experiments over 11 state-of-the-art VLMs reveal that these models are indeed vulnerable to misleading textual prompts, often overriding clear visual evidence in favor of the conflicting text, and show an average performance drop of over 48.2% after only one round of persuasive conversation. Our findings highlight a critical limitation in current VLMs and underscore the need for improved robustness against textual manipulation.
title	Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs
topic	Computation and Language
url	https://arxiv.org/abs/2601.19202

Ejemplares similares