Saved in:
Bibliographic Details
Main Authors: Castro, Santiago, Ziai, Amir, Saluja, Avneesh, Yuan, Zhuoning, Mihalcea, Rada
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2402.15021
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916143545974784
author Castro, Santiago
Ziai, Amir
Saluja, Avneesh
Yuan, Zhuoning
Mihalcea, Rada
author_facet Castro, Santiago
Ziai, Amir
Saluja, Avneesh
Yuan, Zhuoning
Mihalcea, Rada
contents Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove.
format Preprint
id arxiv_https___arxiv_org_abs_2402_15021
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models
Castro, Santiago
Ziai, Amir
Saluja, Avneesh
Yuan, Zhuoning
Mihalcea, Rada
Computer Vision and Pattern Recognition
Computation and Language
Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove.
title CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models
topic Computer Vision and Pattern Recognition
Computation and Language
url https://arxiv.org/abs/2402.15021