Saved in:
Bibliographic Details
Main Authors: Sharifdeen, Ashshak, Shamshad, Fahad, Munir, Muhammad Akhtar, Basu, Abhishek, Ismithdeen, Mohamed Insaf, Jeyamohan, Jeyapriyan, Silva, Chathurika Sewwandi, Nandakumar, Karthik, Khan, Muhammad Haris
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.19024
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910029171392512
author Sharifdeen, Ashshak
Shamshad, Fahad
Munir, Muhammad Akhtar
Basu, Abhishek
Ismithdeen, Mohamed Insaf
Jeyamohan, Jeyapriyan
Silva, Chathurika Sewwandi
Nandakumar, Karthik
Khan, Muhammad Haris
author_facet Sharifdeen, Ashshak
Shamshad, Fahad
Munir, Muhammad Akhtar
Basu, Abhishek
Ismithdeen, Mohamed Insaf
Jeyamohan, Jeyapriyan
Silva, Chathurika Sewwandi
Nandakumar, Karthik
Khan, Muhammad Haris
contents Prompt tuning of large-scale vision-language models such as CLIP enables efficient task adaptation without updating model weights. However, it often leads to poor confidence calibration and unreliable predictive uncertainty. We address this problem by proposing a calibration framework that enhances predictive reliability while preserving the geometry of the pretrained CLIP embedding space, which is required for robust generalization. Our approach extends the standard cross-entropy loss with two complementary regularizers: (1) a mean-variance margin penalty that stabilizes inter-class logit margins by maximizing their average while minimizing dispersion, mitigating underconfidence and overconfidence spikes; and (2) a text moment-matching loss that aligns the first and second moments of tuned text embeddings with their frozen CLIP counterparts, preserving semantic dispersion crucial for generalization. Through extensive experiments across 7 prompt-tuning methods and 11 diverse datasets, we demonstrate that our approach significantly reduces the Expected Calibration Error (ECE) compared to competitive calibration techniques on both base and novel classes
format Preprint
id arxiv_https___arxiv_org_abs_2602_19024
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Towards Calibrating Prompt Tuning of Vision-Language Models
Sharifdeen, Ashshak
Shamshad, Fahad
Munir, Muhammad Akhtar
Basu, Abhishek
Ismithdeen, Mohamed Insaf
Jeyamohan, Jeyapriyan
Silva, Chathurika Sewwandi
Nandakumar, Karthik
Khan, Muhammad Haris
Computer Vision and Pattern Recognition
Prompt tuning of large-scale vision-language models such as CLIP enables efficient task adaptation without updating model weights. However, it often leads to poor confidence calibration and unreliable predictive uncertainty. We address this problem by proposing a calibration framework that enhances predictive reliability while preserving the geometry of the pretrained CLIP embedding space, which is required for robust generalization. Our approach extends the standard cross-entropy loss with two complementary regularizers: (1) a mean-variance margin penalty that stabilizes inter-class logit margins by maximizing their average while minimizing dispersion, mitigating underconfidence and overconfidence spikes; and (2) a text moment-matching loss that aligns the first and second moments of tuned text embeddings with their frozen CLIP counterparts, preserving semantic dispersion crucial for generalization. Through extensive experiments across 7 prompt-tuning methods and 11 diverse datasets, we demonstrate that our approach significantly reduces the Expected Calibration Error (ECE) compared to competitive calibration techniques on both base and novel classes
title Towards Calibrating Prompt Tuning of Vision-Language Models
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2602.19024