Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jia, Qi, Yue, Xiang, Huang, Shanshan, Qin, Ziheng, Liu, Yizhu, Lin, Bill Yuchen, You, Yang, Zhai, Guangtao
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2410.01733
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912603778842624
author	Jia, Qi Yue, Xiang Huang, Shanshan Qin, Ziheng Liu, Yizhu Lin, Bill Yuchen You, Yang Zhai, Guangtao
author_facet	Jia, Qi Yue, Xiang Huang, Shanshan Qin, Ziheng Liu, Yizhu Lin, Bill Yuchen You, Yang Zhai, Guangtao
contents	Perceiving visual semantics embedded within consecutive characters is a crucial yet under-explored capability for both Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs). In this work, we select ASCII art as a representative artifact. It depicts concepts through careful arrangement of characters, which can be formulated in both text and image modalities. We frame the problem as a recognition task, and construct a novel benchmark, ASCIIEval. It covers over 3K samples with an elaborate categorization tree, along with a training set for further enhancement. Encompassing a comprehensive analysis of tens of models through different input modalities, our benchmark demonstrate its multi-faceted diagnostic power. Given textual input, language models shows their visual perception ability on ASCII art concepts. Proprietary models achieve over 70% accuracy on certain categories, with GPT-5 topping the rank. For image inputs, we reveal that open-source MLLMs suffer from a trade-off between fine-grained text recognition and collective visual perception. They exhibit limited generalization ability to this special kind of arts, leading to the dramatic gap of over 20.01% accuracy compared with their proprietary counterparts. Another critical finding is that model performance is sensitive to the length of the ASCII art, with this sensitivity varying across input modalities. Unfortunately, none of the models could successfully benefit from the simultaneous provision of both modalities, highlighting the need for more flexible modality-fusion approaches. Besides, we also introduce approaches for further enhancement and discuss future directions. Resources are available at https://github.com/JiaQiSJTU/VisionInText.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_01733
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art Jia, Qi Yue, Xiang Huang, Shanshan Qin, Ziheng Liu, Yizhu Lin, Bill Yuchen You, Yang Zhai, Guangtao Computation and Language Perceiving visual semantics embedded within consecutive characters is a crucial yet under-explored capability for both Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs). In this work, we select ASCII art as a representative artifact. It depicts concepts through careful arrangement of characters, which can be formulated in both text and image modalities. We frame the problem as a recognition task, and construct a novel benchmark, ASCIIEval. It covers over 3K samples with an elaborate categorization tree, along with a training set for further enhancement. Encompassing a comprehensive analysis of tens of models through different input modalities, our benchmark demonstrate its multi-faceted diagnostic power. Given textual input, language models shows their visual perception ability on ASCII art concepts. Proprietary models achieve over 70% accuracy on certain categories, with GPT-5 topping the rank. For image inputs, we reveal that open-source MLLMs suffer from a trade-off between fine-grained text recognition and collective visual perception. They exhibit limited generalization ability to this special kind of arts, leading to the dramatic gap of over 20.01% accuracy compared with their proprietary counterparts. Another critical finding is that model performance is sensitive to the length of the ASCII art, with this sensitivity varying across input modalities. Unfortunately, none of the models could successfully benefit from the simultaneous provision of both modalities, highlighting the need for more flexible modality-fusion approaches. Besides, we also introduce approaches for further enhancement and discuss future directions. Resources are available at https://github.com/JiaQiSJTU/VisionInText.
title	ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art
topic	Computation and Language
url	https://arxiv.org/abs/2410.01733

Similar Items