Saved in:
Bibliographic Details
Main Authors: Assouel, Rim, Astolfi, Pietro, Bordes, Florian, Drozdzal, Michal, Romero-Soriano, Adriana
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.14113
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910835924795392
author Assouel, Rim
Astolfi, Pietro
Bordes, Florian
Drozdzal, Michal
Romero-Soriano, Adriana
author_facet Assouel, Rim
Astolfi, Pietro
Bordes, Florian
Drozdzal, Michal
Romero-Soriano, Adriana
contents Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two modalities. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model not only enhances the performance of CLIP-based models in multi-object compositional understanding but also paves the way towards more accurate and sample-efficient image-text matching of complex scenes.
format Preprint
id arxiv_https___arxiv_org_abs_2502_14113
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Object-centric Binding in Contrastive Language-Image Pretraining
Assouel, Rim
Astolfi, Pietro
Bordes, Florian
Drozdzal, Michal
Romero-Soriano, Adriana
Computer Vision and Pattern Recognition
Artificial Intelligence
Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two modalities. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model not only enhances the performance of CLIP-based models in multi-object compositional understanding but also paves the way towards more accurate and sample-efficient image-text matching of complex scenes.
title Object-centric Binding in Contrastive Language-Image Pretraining
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2502.14113