Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Baron, Ethan, Tankel, Idan, Tu, Peter, Ben-Yosef, Guy
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2412.13947
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917872526163968
author	Baron, Ethan Tankel, Idan Tu, Peter Ben-Yosef, Guy
author_facet	Baron, Ethan Tankel, Idan Tu, Peter Ben-Yosef, Guy
contents	In this study, we define and tackle zero shot "real" classification by description, a novel task that evaluates the ability of Vision-Language Models (VLMs) like CLIP to classify objects based solely on descriptive attributes, excluding object class names. This approach highlights the current limitations of VLMs in understanding intricate object descriptions, pushing these models beyond mere object recognition. To facilitate this exploration, we introduce a new challenge and release description data for six popular fine-grained benchmarks, which omit object names to encourage genuine zero-shot learning within the research community. Additionally, we propose a method to enhance CLIP's attribute detection capabilities through targeted training using ImageNet21k's diverse object categories, paired with rich attribute descriptions generated by large language models. Furthermore, we introduce a modified CLIP architecture that leverages multiple resolutions to improve the detection of fine-grained part attributes. Through these efforts, we broaden the understanding of part-attribute recognition in CLIP, improving its performance in fine-grained classification tasks across six popular benchmarks, as well as in the PACO dataset, a widely used benchmark for object-attribute recognition. Code is available at: https://github.com/ethanbar11/grounding_ge_public.
format	Preprint
id	arxiv_https___arxiv_org_abs_2412_13947
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Real Classification by Description: Extending CLIP's Limits of Part Attributes Recognition Baron, Ethan Tankel, Idan Tu, Peter Ben-Yosef, Guy Computer Vision and Pattern Recognition In this study, we define and tackle zero shot "real" classification by description, a novel task that evaluates the ability of Vision-Language Models (VLMs) like CLIP to classify objects based solely on descriptive attributes, excluding object class names. This approach highlights the current limitations of VLMs in understanding intricate object descriptions, pushing these models beyond mere object recognition. To facilitate this exploration, we introduce a new challenge and release description data for six popular fine-grained benchmarks, which omit object names to encourage genuine zero-shot learning within the research community. Additionally, we propose a method to enhance CLIP's attribute detection capabilities through targeted training using ImageNet21k's diverse object categories, paired with rich attribute descriptions generated by large language models. Furthermore, we introduce a modified CLIP architecture that leverages multiple resolutions to improve the detection of fine-grained part attributes. Through these efforts, we broaden the understanding of part-attribute recognition in CLIP, improving its performance in fine-grained classification tasks across six popular benchmarks, as well as in the PACO dataset, a widely used benchmark for object-attribute recognition. Code is available at: https://github.com/ethanbar11/grounding_ge_public.
title	Real Classification by Description: Extending CLIP's Limits of Part Attributes Recognition
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2412.13947

Similar Items