Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Assouel, Rim, Astolfi, Pietro, Bordes, Florian, Drozdzal, Michal, Romero-Soriano, Adriana
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2502.14113
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910835924795392
author	Assouel, Rim Astolfi, Pietro Bordes, Florian Drozdzal, Michal Romero-Soriano, Adriana
author_facet	Assouel, Rim Astolfi, Pietro Bordes, Florian Drozdzal, Michal Romero-Soriano, Adriana
contents	Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two modalities. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model not only enhances the performance of CLIP-based models in multi-object compositional understanding but also paves the way towards more accurate and sample-efficient image-text matching of complex scenes.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_14113
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Object-centric Binding in Contrastive Language-Image Pretraining Assouel, Rim Astolfi, Pietro Bordes, Florian Drozdzal, Michal Romero-Soriano, Adriana Computer Vision and Pattern Recognition Artificial Intelligence Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two modalities. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model not only enhances the performance of CLIP-based models in multi-object compositional understanding but also paves the way towards more accurate and sample-efficient image-text matching of complex scenes.
title	Object-centric Binding in Contrastive Language-Image Pretraining
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2502.14113

Similar Items