Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lei, Chenyang, Chen, Liyi, Cen, Jun, Chen, Xiao, Lei, Zhen, Heide, Felix, Liu, Ziwei, Chen, Qifeng, Zhang, Zhaoxiang
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2409.08083
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912024900927488
author	Lei, Chenyang Chen, Liyi Cen, Jun Chen, Xiao Lei, Zhen Heide, Felix Liu, Ziwei Chen, Qifeng Zhang, Zhaoxiang
author_facet	Lei, Chenyang Chen, Liyi Cen, Jun Chen, Xiao Lei, Zhen Heide, Felix Liu, Ziwei Chen, Qifeng Zhang, Zhaoxiang
contents	Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework SimMAT to study an open problem: the transferability from vision foundation models trained on natural RGB images to other image modalities of different physical properties (e.g., polarization). SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained foundation model. We apply SimMAT to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new image modality. Given the absence of relevant benchmarks, we construct a new benchmark to evaluate the transfer learning performance. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance. Specifically, SimMAT can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. We hope that SimMAT can raise awareness of cross-modal transfer learning and benefit various fields for better results with vision foundation models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_08083
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality Lei, Chenyang Chen, Liyi Cen, Jun Chen, Xiao Lei, Zhen Heide, Felix Liu, Ziwei Chen, Qifeng Zhang, Zhaoxiang Computer Vision and Pattern Recognition Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework SimMAT to study an open problem: the transferability from vision foundation models trained on natural RGB images to other image modalities of different physical properties (e.g., polarization). SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained foundation model. We apply SimMAT to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new image modality. Given the absence of relevant benchmarks, we construct a new benchmark to evaluate the transfer learning performance. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance. Specifically, SimMAT can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. We hope that SimMAT can raise awareness of cross-modal transfer learning and benefit various fields for better results with vision foundation models.
title	SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2409.08083

Similar Items