Saved in:
Bibliographic Details
Main Authors: Wang, Ziteng, He, Yujie, Li, Guanliang, Yang, Siqi, Xiong, Jiaqi, Liu, Songxiang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.04897
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915716516544512
author Wang, Ziteng
He, Yujie
Li, Guanliang
Yang, Siqi
Xiong, Jiaqi
Liu, Songxiang
author_facet Wang, Ziteng
He, Yujie
Li, Guanliang
Yang, Siqi
Xiong, Jiaqi
Liu, Songxiang
contents Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.
format Preprint
id arxiv_https___arxiv_org_abs_2601_04897
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle V-FAT: Benchmarking Visual Fidelity Against Text-bias
Wang, Ziteng
He, Yujie
Li, Guanliang
Yang, Siqi
Xiong, Jiaqi
Liu, Songxiang
Computation and Language
Computer Vision and Pattern Recognition
Machine Learning
Multimedia
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.
title V-FAT: Benchmarking Visual Fidelity Against Text-bias
topic Computation and Language
Computer Vision and Pattern Recognition
Machine Learning
Multimedia
url https://arxiv.org/abs/2601.04897