Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mansourian, Amir M., Babaei, Amir Mohammad, Kasaei, Shohreh
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2511.09286
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917076248035328
author	Mansourian, Amir M. Babaei, Amir Mohammad Kasaei, Shohreh
author_facet	Mansourian, Amir M. Babaei, Amir Mohammad Kasaei, Shohreh
contents	Multi-teacher knowledge distillation (KD), a more effective technique than traditional single-teacher methods, transfers knowledge from expert teachers to a compact student model using logit or feature matching. However, most existing approaches lack knowledge diversity, as they rely solely on unimodal visual information, overlooking the potential of cross-modal representations. In this work, we explore the use of CLIP's vision-language knowledge as a complementary source of supervision for KD, an area that remains largely underexplored. We propose a simple yet effective framework that fuses the logits and features of a conventional teacher with those from CLIP. By incorporating CLIP's multi-prompt textual guidance, the fused supervision captures both dataset-specific and semantically enriched visual cues. Beyond accuracy, analysis shows that the fused teacher yields more confident and reliable predictions, significantly increasing confident-correct cases while reducing confidently wrong ones. Moreover, fusion with CLIP refines the entire logit distribution, producing semantically meaningful probabilities for non-target classes, thereby improving inter-class consistency and distillation quality. Despite its simplicity, the proposed method, Enriching Knowledge Distillation (RichKD), consistently outperforms most existing baselines across multiple benchmarks and exhibits stronger robustness under distribution shifts and input corruptions.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_09286
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Enriching Knowledge Distillation with Cross-Modal Teacher Fusion Mansourian, Amir M. Babaei, Amir Mohammad Kasaei, Shohreh Computer Vision and Pattern Recognition Multi-teacher knowledge distillation (KD), a more effective technique than traditional single-teacher methods, transfers knowledge from expert teachers to a compact student model using logit or feature matching. However, most existing approaches lack knowledge diversity, as they rely solely on unimodal visual information, overlooking the potential of cross-modal representations. In this work, we explore the use of CLIP's vision-language knowledge as a complementary source of supervision for KD, an area that remains largely underexplored. We propose a simple yet effective framework that fuses the logits and features of a conventional teacher with those from CLIP. By incorporating CLIP's multi-prompt textual guidance, the fused supervision captures both dataset-specific and semantically enriched visual cues. Beyond accuracy, analysis shows that the fused teacher yields more confident and reliable predictions, significantly increasing confident-correct cases while reducing confidently wrong ones. Moreover, fusion with CLIP refines the entire logit distribution, producing semantically meaningful probabilities for non-target classes, thereby improving inter-class consistency and distillation quality. Despite its simplicity, the proposed method, Enriching Knowledge Distillation (RichKD), consistently outperforms most existing baselines across multiple benchmarks and exhibits stronger robustness under distribution shifts and input corruptions.
title	Enriching Knowledge Distillation with Cross-Modal Teacher Fusion
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2511.09286

Similar Items