Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.19674 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866912733257007104 |
|---|---|
| author | Zhang, Tong Shen, Shu Chen, C. L. Philip |
| author_facet | Zhang, Tong Shen, Shu Chen, C. L. Philip |
| contents | Multimodal learning enhances the performance of various machine learning tasks by leveraging complementary information across different modalities. However, existing methods often learn multimodal representations that retain substantial inter-class confusion, making it difficult to achieve high-confidence predictions, particularly in real-world scenarios with low-quality or noisy data. To address this challenge, we propose Multi-Level Adaptive DeConfusion (MLAD), which eliminates inter-class confusion in multimodal data at both global and sample levels, significantly enhancing the classification reliability of multimodal models. Specifically, MLAD first learns class-wise latent distributions with global-level confusion removed via dynamic-exit modality encoders that adapt to the varying discrimination difficulty of each class and a cross-class residual reconstruction mechanism. Subsequently, MLAD further removes sample-specific confusion through sample-adaptive cross-modality rectification guided by confusion-free modality priors. These priors are constructed from low-confusion modality features, identified by evaluating feature confusion using the learned class-wise latent distributions and selecting those with low confusion via a Gaussian mixture model. Experiments demonstrate that MLAD outperforms state-of-the-art methods across multiple benchmarks and exhibits superior reliability. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2502_19674 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Reliable Multimodal Learning Via Multi-Level Adaptive DeConfusion Zhang, Tong Shen, Shu Chen, C. L. Philip Computer Vision and Pattern Recognition Multimodal learning enhances the performance of various machine learning tasks by leveraging complementary information across different modalities. However, existing methods often learn multimodal representations that retain substantial inter-class confusion, making it difficult to achieve high-confidence predictions, particularly in real-world scenarios with low-quality or noisy data. To address this challenge, we propose Multi-Level Adaptive DeConfusion (MLAD), which eliminates inter-class confusion in multimodal data at both global and sample levels, significantly enhancing the classification reliability of multimodal models. Specifically, MLAD first learns class-wise latent distributions with global-level confusion removed via dynamic-exit modality encoders that adapt to the varying discrimination difficulty of each class and a cross-class residual reconstruction mechanism. Subsequently, MLAD further removes sample-specific confusion through sample-adaptive cross-modality rectification guided by confusion-free modality priors. These priors are constructed from low-confusion modality features, identified by evaluating feature confusion using the learned class-wise latent distributions and selecting those with low confusion via a Gaussian mixture model. Experiments demonstrate that MLAD outperforms state-of-the-art methods across multiple benchmarks and exhibits superior reliability. |
| title | Reliable Multimodal Learning Via Multi-Level Adaptive DeConfusion |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2502.19674 |