Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Rösch, Philipp J., Oswald, Norbert, Geierhos, Michaela, Libovický, Jindřich
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Computation and Language Information Retrieval I.4; I.7
Online Access:	https://arxiv.org/abs/2403.02875
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916345575112704
author	Rösch, Philipp J. Oswald, Norbert Geierhos, Michaela Libovický, Jindřich
author_facet	Rösch, Philipp J. Oswald, Norbert Geierhos, Michaela Libovický, Jindřich
contents	Current multimodal models leveraging contrastive learning often face limitations in developing fine-grained conceptual understanding. This is due to random negative samples during pretraining, causing almost exclusively very dissimilar concepts to be compared in the loss function. Consequently, the models struggle with fine-grained semantic differences. To address this problem, we introduce a novel pretraining method incorporating synthetic hard negative text examples. The hard negatives permute terms corresponding to visual concepts, leading to a more fine-grained visual and textual concept alignment. Further, we introduce InpaintCOCO, a new challenging dataset for assessing the fine-grained alignment of colors, objects, and sizes in vision-language models. We created the dataset using generative inpainting from COCO images by changing the visual concepts so that the images no longer match their original captions. Our results show significant improvements in fine-grained concept understanding across a wide range of vision-language datasets, including our InpaintCOCO dataset.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_02875
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples Rösch, Philipp J. Oswald, Norbert Geierhos, Michaela Libovický, Jindřich Computer Vision and Pattern Recognition Computation and Language Information Retrieval I.4; I.7 Current multimodal models leveraging contrastive learning often face limitations in developing fine-grained conceptual understanding. This is due to random negative samples during pretraining, causing almost exclusively very dissimilar concepts to be compared in the loss function. Consequently, the models struggle with fine-grained semantic differences. To address this problem, we introduce a novel pretraining method incorporating synthetic hard negative text examples. The hard negatives permute terms corresponding to visual concepts, leading to a more fine-grained visual and textual concept alignment. Further, we introduce InpaintCOCO, a new challenging dataset for assessing the fine-grained alignment of colors, objects, and sizes in vision-language models. We created the dataset using generative inpainting from COCO images by changing the visual concepts so that the images no longer match their original captions. Our results show significant improvements in fine-grained concept understanding across a wide range of vision-language datasets, including our InpaintCOCO dataset.
title	Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples
topic	Computer Vision and Pattern Recognition Computation and Language Information Retrieval I.4; I.7
url	https://arxiv.org/abs/2403.02875

Similar Items