Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhong, Li, Ghazal, Ahmed, Wan, Jun-Jun, Zilly, Frederik, Mackens, Patrick, Vollrath, Joachim E., Coseriu, Bogdan Sorin
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2505.18039
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915301565661184
author	Zhong, Li Ghazal, Ahmed Wan, Jun-Jun Zilly, Frederik Mackens, Patrick Vollrath, Joachim E. Coseriu, Bogdan Sorin
author_facet	Zhong, Li Ghazal, Ahmed Wan, Jun-Jun Zilly, Frederik Mackens, Patrick Vollrath, Joachim E. Coseriu, Bogdan Sorin
contents	Foundation models like CLIP (Contrastive Language-Image Pretraining) have revolutionized vision-language tasks by enabling zero-shot and few-shot learning through cross-modal alignment. However, their computational complexity and large memory footprint make them unsuitable for deployment on resource-constrained edge devices, such as in-car cameras used for image collection and real-time processing. To address this challenge, we propose Clip4Retrofit, an efficient model distillation framework that enables real-time image labeling on edge devices. The framework is deployed on the Retrofit camera, a cost-effective edge device retrofitted into thousands of vehicles, despite strict limitations on compute performance and memory. Our approach distills the knowledge of the CLIP model into a lightweight student model, combining EfficientNet-B3 with multi-layer perceptron (MLP) projection heads to preserve cross-modal alignment while significantly reducing computational requirements. We demonstrate that our distilled model achieves a balance between efficiency and performance, making it ideal for deployment in real-world scenarios. Experimental results show that Clip4Retrofit can perform real-time image labeling and object identification on edge devices with limited resources, offering a practical solution for applications such as autonomous driving and retrofitting existing systems. This work bridges the gap between state-of-the-art vision-language models and their deployment in resource-constrained environments, paving the way for broader adoption of foundation models in edge computing.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_18039
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation Zhong, Li Ghazal, Ahmed Wan, Jun-Jun Zilly, Frederik Mackens, Patrick Vollrath, Joachim E. Coseriu, Bogdan Sorin Computer Vision and Pattern Recognition Foundation models like CLIP (Contrastive Language-Image Pretraining) have revolutionized vision-language tasks by enabling zero-shot and few-shot learning through cross-modal alignment. However, their computational complexity and large memory footprint make them unsuitable for deployment on resource-constrained edge devices, such as in-car cameras used for image collection and real-time processing. To address this challenge, we propose Clip4Retrofit, an efficient model distillation framework that enables real-time image labeling on edge devices. The framework is deployed on the Retrofit camera, a cost-effective edge device retrofitted into thousands of vehicles, despite strict limitations on compute performance and memory. Our approach distills the knowledge of the CLIP model into a lightweight student model, combining EfficientNet-B3 with multi-layer perceptron (MLP) projection heads to preserve cross-modal alignment while significantly reducing computational requirements. We demonstrate that our distilled model achieves a balance between efficiency and performance, making it ideal for deployment in real-world scenarios. Experimental results show that Clip4Retrofit can perform real-time image labeling and object identification on edge devices with limited resources, offering a practical solution for applications such as autonomous driving and retrofitting existing systems. This work bridges the gap between state-of-the-art vision-language models and their deployment in resource-constrained environments, paving the way for broader adoption of foundation models in edge computing.
title	Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2505.18039

Similar Items