Saved in:
Bibliographic Details
Main Authors: Hashmani, Raheem Karim, Merz, Garrett W., Qu, Helen, Pettee, Mariel, Cranmer, Kyle
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.21686
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912925026877440
author Hashmani, Raheem Karim
Merz, Garrett W.
Qu, Helen
Pettee, Mariel
Cranmer, Kyle
author_facet Hashmani, Raheem Karim
Merz, Garrett W.
Qu, Helen
Pettee, Mariel
Cranmer, Kyle
contents We introduce a framework for generating highly multimodal datasets with explicitly calculable mutual information (MI) between modalities. This enables the construction of benchmark datasets that provide a novel testbed for systematic studies of mutual information estimators and multimodal self-supervised learning (SSL) techniques. Our framework constructs realistic datasets with known MI using a flow-based generative model and a structured causal framework for generating correlated latent variables. We benchmark a suite of MI estimators on datasets with varying ground truth MI values and verify that regression performance improves as the MI increases between input modalities and the target value. Finally, we describe how our framework can be applied to contexts including multi-detector astrophysics and SSL studies in the highly multimodal regime.
format Preprint
id arxiv_https___arxiv_org_abs_2510_21686
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Multimodal Datasets with Controllable Mutual Information
Hashmani, Raheem Karim
Merz, Garrett W.
Qu, Helen
Pettee, Mariel
Cranmer, Kyle
Machine Learning
We introduce a framework for generating highly multimodal datasets with explicitly calculable mutual information (MI) between modalities. This enables the construction of benchmark datasets that provide a novel testbed for systematic studies of mutual information estimators and multimodal self-supervised learning (SSL) techniques. Our framework constructs realistic datasets with known MI using a flow-based generative model and a structured causal framework for generating correlated latent variables. We benchmark a suite of MI estimators on datasets with varying ground truth MI values and verify that regression performance improves as the MI increases between input modalities and the target value. Finally, we describe how our framework can be applied to contexts including multi-detector astrophysics and SSL studies in the highly multimodal regime.
title Multimodal Datasets with Controllable Mutual Information
topic Machine Learning
url https://arxiv.org/abs/2510.21686