Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.15021 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866916143545974784 |
|---|---|
| author | Castro, Santiago Ziai, Amir Saluja, Avneesh Yuan, Zhuoning Mihalcea, Rada |
| author_facet | Castro, Santiago Ziai, Amir Saluja, Avneesh Yuan, Zhuoning Mihalcea, Rada |
| contents | Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2402_15021 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models Castro, Santiago Ziai, Amir Saluja, Avneesh Yuan, Zhuoning Mihalcea, Rada Computer Vision and Pattern Recognition Computation and Language Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove. |
| title | CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models |
| topic | Computer Vision and Pattern Recognition Computation and Language |
| url | https://arxiv.org/abs/2402.15021 |