Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xia, Zixuan, Wang, Hao, Weng, Pengcheng, Qian, Yanyu, Xu, Yangxin, Dan, William, Wang, Fei
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2601.21670
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913157618860032
author	Xia, Zixuan Wang, Hao Weng, Pengcheng Qian, Yanyu Xu, Yangxin Dan, William Wang, Fei
author_facet	Xia, Zixuan Wang, Hao Weng, Pengcheng Qian, Yanyu Xu, Yangxin Dan, William Wang, Fei
contents	Multimodal fusion is often treated as an optimization-balancing problem, where training signals are adjusted to prevent one modality from dominating the others. However, balanced optimization does not fully determine the geometry of intermediate representations. Supervised multimodal models may still learn low-diversity modality-specific embeddings or allow paired cross-modal observations to drift excessively apart, weakening both unimodal robustness and multimodal fusion. We introduce \regName, a lightweight plug-and-play geometric regularization framework for multimodal representation learning. Rather than enforcing rigid cross-modal alignment, \regName follows a bounded-agreement principle: preserve modality-specific diversity while softly constraining only the portion of paired cross-modal drift that exceeds an admissible agreement band. Operationally, \regName combines a dispersion term that mitigates spectral concentration with an agreement-band anchoring term that controls excessive paired drift, requiring no architectural modification or inference-time overhead. Experiments across audio-visual, image-text, and RF-based benchmarks show that \regName consistently improves multimodal performance and often strengthens unimodal representations. These results suggest that explicitly regulating representation geometry is an effective complement to optimization balancing, and provide evidence that geometry-aware regularization can improve multimodal learning across diverse architectures and domains.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_21670
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Diverse via bounded Agreement: Geometric Regularization for Multimodal Fusion Xia, Zixuan Wang, Hao Weng, Pengcheng Qian, Yanyu Xu, Yangxin Dan, William Wang, Fei Computer Vision and Pattern Recognition Machine Learning Multimodal fusion is often treated as an optimization-balancing problem, where training signals are adjusted to prevent one modality from dominating the others. However, balanced optimization does not fully determine the geometry of intermediate representations. Supervised multimodal models may still learn low-diversity modality-specific embeddings or allow paired cross-modal observations to drift excessively apart, weakening both unimodal robustness and multimodal fusion. We introduce \regName, a lightweight plug-and-play geometric regularization framework for multimodal representation learning. Rather than enforcing rigid cross-modal alignment, \regName follows a bounded-agreement principle: preserve modality-specific diversity while softly constraining only the portion of paired cross-modal drift that exceeds an admissible agreement band. Operationally, \regName combines a dispersion term that mitigates spectral concentration with an agreement-band anchoring term that controls excessive paired drift, requiring no architectural modification or inference-time overhead. Experiments across audio-visual, image-text, and RF-based benchmarks show that \regName consistently improves multimodal performance and often strengthens unimodal representations. These results suggest that explicitly regulating representation geometry is an effective complement to optimization balancing, and provide evidence that geometry-aware regularization can improve multimodal learning across diverse architectures and domains.
title	Diverse via bounded Agreement: Geometric Regularization for Multimodal Fusion
topic	Computer Vision and Pattern Recognition Machine Learning
url	https://arxiv.org/abs/2601.21670

Similar Items