Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Herzog, Jonas, Wang, Yue
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.16100
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913158907559936
author	Herzog, Jonas Wang, Yue
author_facet	Herzog, Jonas Wang, Yue
contents	Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best results.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_16100
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP Herzog, Jonas Wang, Yue Computer Vision and Pattern Recognition Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best results.
title	Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2603.16100

Similar Items