Saved in:
Bibliographic Details
Main Authors: Zhang, Tong, Shen, Shu, Chen, C. L. Philip
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.19674
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912733257007104
author Zhang, Tong
Shen, Shu
Chen, C. L. Philip
author_facet Zhang, Tong
Shen, Shu
Chen, C. L. Philip
contents Multimodal learning enhances the performance of various machine learning tasks by leveraging complementary information across different modalities. However, existing methods often learn multimodal representations that retain substantial inter-class confusion, making it difficult to achieve high-confidence predictions, particularly in real-world scenarios with low-quality or noisy data. To address this challenge, we propose Multi-Level Adaptive DeConfusion (MLAD), which eliminates inter-class confusion in multimodal data at both global and sample levels, significantly enhancing the classification reliability of multimodal models. Specifically, MLAD first learns class-wise latent distributions with global-level confusion removed via dynamic-exit modality encoders that adapt to the varying discrimination difficulty of each class and a cross-class residual reconstruction mechanism. Subsequently, MLAD further removes sample-specific confusion through sample-adaptive cross-modality rectification guided by confusion-free modality priors. These priors are constructed from low-confusion modality features, identified by evaluating feature confusion using the learned class-wise latent distributions and selecting those with low confusion via a Gaussian mixture model. Experiments demonstrate that MLAD outperforms state-of-the-art methods across multiple benchmarks and exhibits superior reliability.
format Preprint
id arxiv_https___arxiv_org_abs_2502_19674
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Reliable Multimodal Learning Via Multi-Level Adaptive DeConfusion
Zhang, Tong
Shen, Shu
Chen, C. L. Philip
Computer Vision and Pattern Recognition
Multimodal learning enhances the performance of various machine learning tasks by leveraging complementary information across different modalities. However, existing methods often learn multimodal representations that retain substantial inter-class confusion, making it difficult to achieve high-confidence predictions, particularly in real-world scenarios with low-quality or noisy data. To address this challenge, we propose Multi-Level Adaptive DeConfusion (MLAD), which eliminates inter-class confusion in multimodal data at both global and sample levels, significantly enhancing the classification reliability of multimodal models. Specifically, MLAD first learns class-wise latent distributions with global-level confusion removed via dynamic-exit modality encoders that adapt to the varying discrimination difficulty of each class and a cross-class residual reconstruction mechanism. Subsequently, MLAD further removes sample-specific confusion through sample-adaptive cross-modality rectification guided by confusion-free modality priors. These priors are constructed from low-confusion modality features, identified by evaluating feature confusion using the learned class-wise latent distributions and selecting those with low confusion via a Gaussian mixture model. Experiments demonstrate that MLAD outperforms state-of-the-art methods across multiple benchmarks and exhibits superior reliability.
title Reliable Multimodal Learning Via Multi-Level Adaptive DeConfusion
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2502.19674