Saved in:
Bibliographic Details
Main Authors: Chae, Hyunsik, Yoon, Seungwoo, Park, Jaden, Chun, Chloe Yewon, Cho, Yongin, Cai, Mu, Lee, Yong Jae, Ryu, Ernest K.
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.20021
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913859630006272
author Chae, Hyunsik
Yoon, Seungwoo
Park, Jaden
Chun, Chloe Yewon
Cho, Yongin
Cai, Mu
Lee, Yong Jae
Ryu, Ernest K.
author_facet Chae, Hyunsik
Yoon, Seungwoo
Park, Jaden
Chun, Chloe Yewon
Cho, Yongin
Cai, Mu
Lee, Yong Jae
Ryu, Ernest K.
contents Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Atomic Visual Skills Dataset (AVSD) for evaluating VLMs on the atomic visual skills. Using AVSD, we benchmark state-of-the-art VLMs and find that they struggle with these tasks, despite being trivial for adult humans. Our findings highlight the need for purpose-built datasets to train and evaluate VLMs on atomic, rather than composite, visual perception tasks.
format Preprint
id arxiv_https___arxiv_org_abs_2505_20021
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models
Chae, Hyunsik
Yoon, Seungwoo
Park, Jaden
Chun, Chloe Yewon
Cho, Yongin
Cai, Mu
Lee, Yong Jae
Ryu, Ernest K.
Computer Vision and Pattern Recognition
Artificial Intelligence
Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Atomic Visual Skills Dataset (AVSD) for evaluating VLMs on the atomic visual skills. Using AVSD, we benchmark state-of-the-art VLMs and find that they struggle with these tasks, despite being trivial for adult humans. Our findings highlight the need for purpose-built datasets to train and evaluate VLMs on atomic, rather than composite, visual perception tasks.
title Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2505.20021