Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Ziteng, He, Yujie, Li, Guanliang, Yang, Siqi, Xiong, Jiaqi, Liu, Songxiang
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Computer Vision and Pattern Recognition Machine Learning Multimedia
Online Access:	https://arxiv.org/abs/2601.04897
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915716516544512
author	Wang, Ziteng He, Yujie Li, Guanliang Yang, Siqi Xiong, Jiaqi Liu, Songxiang
author_facet	Wang, Ziteng He, Yujie Li, Guanliang Yang, Siqi Xiong, Jiaqi Liu, Songxiang
contents	Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_04897
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	V-FAT: Benchmarking Visual Fidelity Against Text-bias Wang, Ziteng He, Yujie Li, Guanliang Yang, Siqi Xiong, Jiaqi Liu, Songxiang Computation and Language Computer Vision and Pattern Recognition Machine Learning Multimedia Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.
title	V-FAT: Benchmarking Visual Fidelity Against Text-bias
topic	Computation and Language Computer Vision and Pattern Recognition Machine Learning Multimedia
url	https://arxiv.org/abs/2601.04897

Similar Items