Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhu, William Yicheng, Ye, Keren, Ke, Junjie, Yu, Jiahui, Guibas, Leonidas, Milanfar, Peyman, Yang, Feng
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2408.04102
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910627964911616
author	Zhu, William Yicheng Ye, Keren Ke, Junjie Yu, Jiahui Guibas, Leonidas Milanfar, Peyman Yang, Feng
author_facet	Zhu, William Yicheng Ye, Keren Ke, Junjie Yu, Jiahui Guibas, Leonidas Milanfar, Peyman Yang, Feng
contents	Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP's contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of image-object-attribute relations to use towards attribute recognition. Specifically, for each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence encoding the attribute's relation to objects on the image. Unlike contrastive retrieval, which measures likelihood by globally aligning elements of the sentence to the image, generative retrieval is sensitive to the order and dependency of objects and attributes in the sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual Genome Attribute Ranking (VGARank).
format	Preprint
id	arxiv_https___arxiv_org_abs_2408_04102
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling Zhu, William Yicheng Ye, Keren Ke, Junjie Yu, Jiahui Guibas, Leonidas Milanfar, Peyman Yang, Feng Computer Vision and Pattern Recognition Artificial Intelligence Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP's contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of image-object-attribute relations to use towards attribute recognition. Specifically, for each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence encoding the attribute's relation to objects on the image. Unlike contrastive retrieval, which measures likelihood by globally aligning elements of the sentence to the image, generative retrieval is sensitive to the order and dependency of objects and attributes in the sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual Genome Attribute Ranking (VGARank).
title	ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2408.04102

Similar Items