MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Yiu, Eunice, Qraitem, Maan, Majhi, Anisa Noor, Wong, Charlie, Bai, Yutong, Ginosar, Shiry, Gopnik, Alison, Saenko, Kate
Natura:	Preprint
Pubblicazione:	2024
Soggetti:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning
Accesso online:	https://arxiv.org/abs/2407.17773
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866908691809173504
author	Yiu, Eunice Qraitem, Maan Majhi, Anisa Noor Wong, Charlie Bai, Yutong Ginosar, Shiry Gopnik, Alison Saenko, Kate
author_facet	Yiu, Eunice Qraitem, Maan Majhi, Anisa Noor Wong, Charlie Bai, Yutong Ginosar, Shiry Gopnik, Alison Saenko, Kate
contents	This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children. A "visual analogy" is an abstract rule inferred from one image and applied to another. While benchmarks exist for testing visual reasoning in LMMs, they require advanced skills and omit basic visual analogies that even young children can make. Inspired by developmental psychology, we propose a new benchmark of 4,300 visual transformations of everyday objects to test LMMs on visual analogical reasoning and compare them to children (ages three to five) and to adults. We structure the evaluation into three stages: identifying what changed (e.g., color, number, etc.), how it changed (e.g., added one object), and applying the rule to new scenarios. Our findings show that while GPT-o1, GPT-4V, LLaVA-1.5, and MANTIS identify the "what" effectively, they struggle with quantifying the "how" and extrapolating this rule to new objects. In contrast, children and adults exhibit much stronger analogical reasoning at all three stages. Additionally, the strongest tested model, GPT-o1, performs better in tasks involving simple surface-level visual attributes like color and size, correlating with quicker human adult response times. Conversely, more complex tasks such as number, rotation, and reflection, which necessitate extensive cognitive processing and understanding of extrinsic spatial properties in the physical world, present more significant challenges. Altogether, these findings highlight the limitations of training models on data that primarily consists of 2D images and text.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_17773
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models Yiu, Eunice Qraitem, Maan Majhi, Anisa Noor Wong, Charlie Bai, Yutong Ginosar, Shiry Gopnik, Alison Saenko, Kate Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children. A "visual analogy" is an abstract rule inferred from one image and applied to another. While benchmarks exist for testing visual reasoning in LMMs, they require advanced skills and omit basic visual analogies that even young children can make. Inspired by developmental psychology, we propose a new benchmark of 4,300 visual transformations of everyday objects to test LMMs on visual analogical reasoning and compare them to children (ages three to five) and to adults. We structure the evaluation into three stages: identifying what changed (e.g., color, number, etc.), how it changed (e.g., added one object), and applying the rule to new scenarios. Our findings show that while GPT-o1, GPT-4V, LLaVA-1.5, and MANTIS identify the "what" effectively, they struggle with quantifying the "how" and extrapolating this rule to new objects. In contrast, children and adults exhibit much stronger analogical reasoning at all three stages. Additionally, the strongest tested model, GPT-o1, performs better in tasks involving simple surface-level visual attributes like color and size, correlating with quicker human adult response times. Conversely, more complex tasks such as number, rotation, and reflection, which necessitate extensive cognitive processing and understanding of extrinsic spatial properties in the physical world, present more significant challenges. Altogether, these findings highlight the limitations of training models on data that primarily consists of 2D images and text.
title	KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models
topic	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning
url	https://arxiv.org/abs/2407.17773

Documenti analoghi