Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jia, Ding, Guo, Jianyuan, Han, Kai, Wu, Han, Zhang, Chao, Xu, Chang, Chen, Xinghao
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2406.01210
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917683579060224
author	Jia, Ding Guo, Jianyuan Han, Kai Wu, Han Zhang, Chao Xu, Chang Chen, Xinghao
author_facet	Jia, Ding Guo, Jianyuan Han, Kai Wu, Han Zhang, Chao Xu, Chang Chen, Xinghao
contents	Cross-modal transformers have demonstrated superiority in various vision tasks by effectively integrating different modalities. This paper first critiques prior token exchange methods which replace less informative tokens with inter-modal features, and demonstrate exchange based methods underperform cross-attention mechanisms, while the computational demand of the latter inevitably restricts its use with longer sequences. To surmount the computational challenges, we propose GeminiFusion, a pixel-wise fusion approach that capitalizes on aligned cross-modal representations. GeminiFusion elegantly combines intra-modal and inter-modal attentions, dynamically integrating complementary information across modalities. We employ a layer-adaptive noise to adaptively control their interplay on a per-layer basis, thereby achieving a harmonized fusion process. Notably, GeminiFusion maintains linear complexity with respect to the number of input tokens, ensuring this multimodal framework operates with efficiency comparable to unimodal networks. Comprehensive evaluations across multimodal image-to-image translation, 3D object detection and arbitrary-modal semantic segmentation tasks, including RGB, depth, LiDAR, event data, etc. demonstrate the superior performance of our GeminiFusion against leading-edge techniques. The PyTorch code is available at https://github.com/JiaDingCN/GeminiFusion
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_01210
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer Jia, Ding Guo, Jianyuan Han, Kai Wu, Han Zhang, Chao Xu, Chang Chen, Xinghao Computer Vision and Pattern Recognition Cross-modal transformers have demonstrated superiority in various vision tasks by effectively integrating different modalities. This paper first critiques prior token exchange methods which replace less informative tokens with inter-modal features, and demonstrate exchange based methods underperform cross-attention mechanisms, while the computational demand of the latter inevitably restricts its use with longer sequences. To surmount the computational challenges, we propose GeminiFusion, a pixel-wise fusion approach that capitalizes on aligned cross-modal representations. GeminiFusion elegantly combines intra-modal and inter-modal attentions, dynamically integrating complementary information across modalities. We employ a layer-adaptive noise to adaptively control their interplay on a per-layer basis, thereby achieving a harmonized fusion process. Notably, GeminiFusion maintains linear complexity with respect to the number of input tokens, ensuring this multimodal framework operates with efficiency comparable to unimodal networks. Comprehensive evaluations across multimodal image-to-image translation, 3D object detection and arbitrary-modal semantic segmentation tasks, including RGB, depth, LiDAR, event data, etc. demonstrate the superior performance of our GeminiFusion against leading-edge techniques. The PyTorch code is available at https://github.com/JiaDingCN/GeminiFusion
title	GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2406.01210

Similar Items