Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chae, Hyunsik, Yoon, Seungwoo, Park, Jaden, Chun, Chloe Yewon, Cho, Yongin, Cai, Mu, Lee, Yong Jae, Ryu, Ernest K.
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.20021
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913859630006272
author	Chae, Hyunsik Yoon, Seungwoo Park, Jaden Chun, Chloe Yewon Cho, Yongin Cai, Mu Lee, Yong Jae Ryu, Ernest K.
author_facet	Chae, Hyunsik Yoon, Seungwoo Park, Jaden Chun, Chloe Yewon Cho, Yongin Cai, Mu Lee, Yong Jae Ryu, Ernest K.
contents	Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Atomic Visual Skills Dataset (AVSD) for evaluating VLMs on the atomic visual skills. Using AVSD, we benchmark state-of-the-art VLMs and find that they struggle with these tasks, despite being trivial for adult humans. Our findings highlight the need for purpose-built datasets to train and evaluate VLMs on atomic, rather than composite, visual perception tasks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_20021
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models Chae, Hyunsik Yoon, Seungwoo Park, Jaden Chun, Chloe Yewon Cho, Yongin Cai, Mu Lee, Yong Jae Ryu, Ernest K. Computer Vision and Pattern Recognition Artificial Intelligence Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Atomic Visual Skills Dataset (AVSD) for evaluating VLMs on the atomic visual skills. Using AVSD, we benchmark state-of-the-art VLMs and find that they struggle with these tasks, despite being trivial for adult humans. Our findings highlight the need for purpose-built datasets to train and evaluate VLMs on atomic, rather than composite, visual perception tasks.
title	Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2505.20021

Similar Items