Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	You, Haochen, Liu, Baojing
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2508.12149
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909738558554112
author	You, Haochen Liu, Baojing
author_facet	You, Haochen Liu, Baojing
contents	Recent advances in multimodal learning have largely relied on pairwise contrastive objectives to align different modalities, such as text, video, and audio, in a shared embedding space. While effective in bi-modal setups, these approaches struggle to generalize across multiple modalities and often lack semantic structure in high-dimensional spaces. In this paper, we propose MOVER, a novel framework that combines optimal transport-based soft alignment with volume-based geometric regularization to build semantically aligned and structured multimodal representations. By integrating a transport-guided matching mechanism with a geometric volume minimization objective (GAVE), MOVER encourages consistent alignment across all modalities in a modality-agnostic manner. Experiments on text-video-audio retrieval tasks demonstrate that MOVER significantly outperforms prior state-of-the-art methods in both zero-shot and finetuned settings. Additional analysis shows improved generalization to unseen modality combinations and stronger structural consistency in the learned embedding space.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_12149
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	MOVER: Multimodal Optimal Transport with Volume-based Embedding Regularization You, Haochen Liu, Baojing Artificial Intelligence Recent advances in multimodal learning have largely relied on pairwise contrastive objectives to align different modalities, such as text, video, and audio, in a shared embedding space. While effective in bi-modal setups, these approaches struggle to generalize across multiple modalities and often lack semantic structure in high-dimensional spaces. In this paper, we propose MOVER, a novel framework that combines optimal transport-based soft alignment with volume-based geometric regularization to build semantically aligned and structured multimodal representations. By integrating a transport-guided matching mechanism with a geometric volume minimization objective (GAVE), MOVER encourages consistent alignment across all modalities in a modality-agnostic manner. Experiments on text-video-audio retrieval tasks demonstrate that MOVER significantly outperforms prior state-of-the-art methods in both zero-shot and finetuned settings. Additional analysis shows improved generalization to unseen modality combinations and stronger structural consistency in the learned embedding space.
title	MOVER: Multimodal Optimal Transport with Volume-based Embedding Regularization
topic	Artificial Intelligence
url	https://arxiv.org/abs/2508.12149

Similar Items