Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Batsell, Patrick, Tsutsui, Satoshi, Wen, Bihan
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2512.18951
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916008068907008
author	Batsell, Patrick Tsutsui, Satoshi Wen, Bihan
author_facet	Batsell, Patrick Tsutsui, Satoshi Wen, Bihan
contents	Infants learn not only object categories but also fine-grained visual attributes such as color, size, and texture from limited experience. Prior infant-scale vision--language models have mainly been evaluated on object recognition, leaving open whether they support within-class attribute discrimination. We introduce a controlled benchmark that varies color, size, and texture across 67 everyday object classes using synthetic rendering to decouple attribute values from object identity. We evaluate infant-trained models (CVCL and an infant-trained DINO baseline) against web-scale and ImageNet models (CLIP, SigLIP, ResNeXt) under two complementary settings: an image-only prototype test and a text--vision test with attribute--object prompts. We find a dissociation between visual and linguistic attribute information: infant-trained models form strong visual representations for size and discriminate texture comparably to other models, but perform poorly on visual color discrimination, and in the text--vision setting they struggle to ground color and show only modest size grounding. In contrast, web-trained vision--language models strongly ground color from text while exhibiting weaker visual size discrimination.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_18951
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models Batsell, Patrick Tsutsui, Satoshi Wen, Bihan Machine Learning Infants learn not only object categories but also fine-grained visual attributes such as color, size, and texture from limited experience. Prior infant-scale vision--language models have mainly been evaluated on object recognition, leaving open whether they support within-class attribute discrimination. We introduce a controlled benchmark that varies color, size, and texture across 67 everyday object classes using synthetic rendering to decouple attribute values from object identity. We evaluate infant-trained models (CVCL and an infant-trained DINO baseline) against web-scale and ImageNet models (CLIP, SigLIP, ResNeXt) under two complementary settings: an image-only prototype test and a text--vision test with attribute--object prompts. We find a dissociation between visual and linguistic attribute information: infant-trained models form strong visual representations for size and discriminate texture comparably to other models, but perform poorly on visual color discrimination, and in the text--vision setting they struggle to ground color and show only modest size grounding. In contrast, web-trained vision--language models strongly ground color from text while exhibiting weaker visual size discrimination.
title	Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models
topic	Machine Learning
url	https://arxiv.org/abs/2512.18951

Similar Items