Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.18951 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866916008068907008 |
|---|---|
| author | Batsell, Patrick Tsutsui, Satoshi Wen, Bihan |
| author_facet | Batsell, Patrick Tsutsui, Satoshi Wen, Bihan |
| contents | Infants learn not only object categories but also fine-grained visual attributes such as color, size, and texture from limited experience. Prior infant-scale vision--language models have mainly been evaluated on object recognition, leaving open whether they support within-class attribute discrimination. We introduce a controlled benchmark that varies color, size, and texture across 67 everyday object classes using synthetic rendering to decouple attribute values from object identity. We evaluate infant-trained models (CVCL and an infant-trained DINO baseline) against web-scale and ImageNet models (CLIP, SigLIP, ResNeXt) under two complementary settings: an image-only prototype test and a text--vision test with attribute--object prompts. We find a dissociation between visual and linguistic attribute information: infant-trained models form strong visual representations for size and discriminate texture comparably to other models, but perform poorly on visual color discrimination, and in the text--vision setting they struggle to ground color and show only modest size grounding. In contrast, web-trained vision--language models strongly ground color from text while exhibiting weaker visual size discrimination. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2512_18951 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models Batsell, Patrick Tsutsui, Satoshi Wen, Bihan Machine Learning Infants learn not only object categories but also fine-grained visual attributes such as color, size, and texture from limited experience. Prior infant-scale vision--language models have mainly been evaluated on object recognition, leaving open whether they support within-class attribute discrimination. We introduce a controlled benchmark that varies color, size, and texture across 67 everyday object classes using synthetic rendering to decouple attribute values from object identity. We evaluate infant-trained models (CVCL and an infant-trained DINO baseline) against web-scale and ImageNet models (CLIP, SigLIP, ResNeXt) under two complementary settings: an image-only prototype test and a text--vision test with attribute--object prompts. We find a dissociation between visual and linguistic attribute information: infant-trained models form strong visual representations for size and discriminate texture comparably to other models, but perform poorly on visual color discrimination, and in the text--vision setting they struggle to ground color and show only modest size grounding. In contrast, web-trained vision--language models strongly ground color from text while exhibiting weaker visual size discrimination. |
| title | Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2512.18951 |