Saved in:
Bibliographic Details
Main Authors: Jia, Qi, Yue, Xiang, Huang, Shanshan, Qin, Ziheng, Liu, Yizhu, Lin, Bill Yuchen, You, Yang, Zhai, Guangtao
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2410.01733
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912603778842624
author Jia, Qi
Yue, Xiang
Huang, Shanshan
Qin, Ziheng
Liu, Yizhu
Lin, Bill Yuchen
You, Yang
Zhai, Guangtao
author_facet Jia, Qi
Yue, Xiang
Huang, Shanshan
Qin, Ziheng
Liu, Yizhu
Lin, Bill Yuchen
You, Yang
Zhai, Guangtao
contents Perceiving visual semantics embedded within consecutive characters is a crucial yet under-explored capability for both Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs). In this work, we select ASCII art as a representative artifact. It depicts concepts through careful arrangement of characters, which can be formulated in both text and image modalities. We frame the problem as a recognition task, and construct a novel benchmark, ASCIIEval. It covers over 3K samples with an elaborate categorization tree, along with a training set for further enhancement. Encompassing a comprehensive analysis of tens of models through different input modalities, our benchmark demonstrate its multi-faceted diagnostic power. Given textual input, language models shows their visual perception ability on ASCII art concepts. Proprietary models achieve over 70% accuracy on certain categories, with GPT-5 topping the rank. For image inputs, we reveal that open-source MLLMs suffer from a trade-off between fine-grained text recognition and collective visual perception. They exhibit limited generalization ability to this special kind of arts, leading to the dramatic gap of over 20.01% accuracy compared with their proprietary counterparts. Another critical finding is that model performance is sensitive to the length of the ASCII art, with this sensitivity varying across input modalities. Unfortunately, none of the models could successfully benefit from the simultaneous provision of both modalities, highlighting the need for more flexible modality-fusion approaches. Besides, we also introduce approaches for further enhancement and discuss future directions. Resources are available at https://github.com/JiaQiSJTU/VisionInText.
format Preprint
id arxiv_https___arxiv_org_abs_2410_01733
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art
Jia, Qi
Yue, Xiang
Huang, Shanshan
Qin, Ziheng
Liu, Yizhu
Lin, Bill Yuchen
You, Yang
Zhai, Guangtao
Computation and Language
Perceiving visual semantics embedded within consecutive characters is a crucial yet under-explored capability for both Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs). In this work, we select ASCII art as a representative artifact. It depicts concepts through careful arrangement of characters, which can be formulated in both text and image modalities. We frame the problem as a recognition task, and construct a novel benchmark, ASCIIEval. It covers over 3K samples with an elaborate categorization tree, along with a training set for further enhancement. Encompassing a comprehensive analysis of tens of models through different input modalities, our benchmark demonstrate its multi-faceted diagnostic power. Given textual input, language models shows their visual perception ability on ASCII art concepts. Proprietary models achieve over 70% accuracy on certain categories, with GPT-5 topping the rank. For image inputs, we reveal that open-source MLLMs suffer from a trade-off between fine-grained text recognition and collective visual perception. They exhibit limited generalization ability to this special kind of arts, leading to the dramatic gap of over 20.01% accuracy compared with their proprietary counterparts. Another critical finding is that model performance is sensitive to the length of the ASCII art, with this sensitivity varying across input modalities. Unfortunately, none of the models could successfully benefit from the simultaneous provision of both modalities, highlighting the need for more flexible modality-fusion approaches. Besides, we also introduce approaches for further enhancement and discuss future directions. Resources are available at https://github.com/JiaQiSJTU/VisionInText.
title ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art
topic Computation and Language
url https://arxiv.org/abs/2410.01733