Saved in:
Bibliographic Details
Main Authors: Liu, Qing'an, Feng, Juntong, Wang, Yuhao, Han, Xinzhe, Cheng, Yujie, Zhu, Yue, Diao, Haiwen, Zhuge, Yunzhi, Lu, Huchuan
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.04802
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916017277501440
author Liu, Qing'an
Feng, Juntong
Wang, Yuhao
Han, Xinzhe
Cheng, Yujie
Zhu, Yue
Diao, Haiwen
Zhuge, Yunzhi
Lu, Huchuan
author_facet Liu, Qing'an
Feng, Juntong
Wang, Yuhao
Han, Xinzhe
Cheng, Yujie
Zhu, Yue
Diao, Haiwen
Zhuge, Yunzhi
Lu, Huchuan
contents Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also frequently appears as visualized text embedded in images, raising the question of whether current VLMs handle such input requests comparably. We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions. Extensive evaluation of over 30 representative VLMs reveals a pronounced modality gap: models that perform well on pure-text queries often degrade substantially when equivalent semantic content is presented as visualized text. This gap is further amplified by increased perceptual difficulty, highlighting sensitivity to rendering variations despite unchanged semantics. Overall, VISTA-Bench provides a principled evaluation framework to diagnose this limitation and to guide progress toward more unified language representations across tokenized text and pixels.
format Preprint
id arxiv_https___arxiv_org_abs_2602_04802
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?
Liu, Qing'an
Feng, Juntong
Wang, Yuhao
Han, Xinzhe
Cheng, Yujie
Zhu, Yue
Diao, Haiwen
Zhuge, Yunzhi
Lu, Huchuan
Computer Vision and Pattern Recognition
Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also frequently appears as visualized text embedded in images, raising the question of whether current VLMs handle such input requests comparably. We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions. Extensive evaluation of over 30 representative VLMs reveals a pronounced modality gap: models that perform well on pure-text queries often degrade substantially when equivalent semantic content is presented as visualized text. This gap is further amplified by increased perceptual difficulty, highlighting sensitivity to rendering variations despite unchanged semantics. Overall, VISTA-Bench provides a principled evaluation framework to diagnose this limitation and to guide progress toward more unified language representations across tokenized text and pixels.
title VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2602.04802