Saved in:
Bibliographic Details
Main Authors: Xia, Zixuan, Wang, Hao, Weng, Pengcheng, Qian, Yanyu, Xu, Yangxin, Dan, William, Wang, Fei
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.21670
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913157618860032
author Xia, Zixuan
Wang, Hao
Weng, Pengcheng
Qian, Yanyu
Xu, Yangxin
Dan, William
Wang, Fei
author_facet Xia, Zixuan
Wang, Hao
Weng, Pengcheng
Qian, Yanyu
Xu, Yangxin
Dan, William
Wang, Fei
contents Multimodal fusion is often treated as an optimization-balancing problem, where training signals are adjusted to prevent one modality from dominating the others. However, balanced optimization does not fully determine the geometry of intermediate representations. Supervised multimodal models may still learn low-diversity modality-specific embeddings or allow paired cross-modal observations to drift excessively apart, weakening both unimodal robustness and multimodal fusion. We introduce \regName, a lightweight plug-and-play geometric regularization framework for multimodal representation learning. Rather than enforcing rigid cross-modal alignment, \regName follows a bounded-agreement principle: preserve modality-specific diversity while softly constraining only the portion of paired cross-modal drift that exceeds an admissible agreement band. Operationally, \regName combines a dispersion term that mitigates spectral concentration with an agreement-band anchoring term that controls excessive paired drift, requiring no architectural modification or inference-time overhead. Experiments across audio-visual, image-text, and RF-based benchmarks show that \regName consistently improves multimodal performance and often strengthens unimodal representations. These results suggest that explicitly regulating representation geometry is an effective complement to optimization balancing, and provide evidence that geometry-aware regularization can improve multimodal learning across diverse architectures and domains.
format Preprint
id arxiv_https___arxiv_org_abs_2601_21670
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Diverse via bounded Agreement: Geometric Regularization for Multimodal Fusion
Xia, Zixuan
Wang, Hao
Weng, Pengcheng
Qian, Yanyu
Xu, Yangxin
Dan, William
Wang, Fei
Computer Vision and Pattern Recognition
Machine Learning
Multimodal fusion is often treated as an optimization-balancing problem, where training signals are adjusted to prevent one modality from dominating the others. However, balanced optimization does not fully determine the geometry of intermediate representations. Supervised multimodal models may still learn low-diversity modality-specific embeddings or allow paired cross-modal observations to drift excessively apart, weakening both unimodal robustness and multimodal fusion. We introduce \regName, a lightweight plug-and-play geometric regularization framework for multimodal representation learning. Rather than enforcing rigid cross-modal alignment, \regName follows a bounded-agreement principle: preserve modality-specific diversity while softly constraining only the portion of paired cross-modal drift that exceeds an admissible agreement band. Operationally, \regName combines a dispersion term that mitigates spectral concentration with an agreement-band anchoring term that controls excessive paired drift, requiring no architectural modification or inference-time overhead. Experiments across audio-visual, image-text, and RF-based benchmarks show that \regName consistently improves multimodal performance and often strengthens unimodal representations. These results suggest that explicitly regulating representation geometry is an effective complement to optimization balancing, and provide evidence that geometry-aware regularization can improve multimodal learning across diverse architectures and domains.
title Diverse via bounded Agreement: Geometric Regularization for Multimodal Fusion
topic Computer Vision and Pattern Recognition
Machine Learning
url https://arxiv.org/abs/2601.21670