Saved in:
Bibliographic Details
Main Authors: Zhang, Dengjia, Martin, Alexander, Jurayj, William, Murray, Kenton, Van Durme, Benjamin, Kriz, Reno
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.08701
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915934507106304
author Zhang, Dengjia
Martin, Alexander
Jurayj, William
Murray, Kenton
Van Durme, Benjamin
Kriz, Reno
author_facet Zhang, Dengjia
Martin, Alexander
Jurayj, William
Murray, Kenton
Van Durme, Benjamin
Kriz, Reno
contents We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.
format Preprint
id arxiv_https___arxiv_org_abs_2604_08701
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Unified Multimodal Uncertain Inference
Zhang, Dengjia
Martin, Alexander
Jurayj, William
Murray, Kenton
Van Durme, Benjamin
Kriz, Reno
Computer Vision and Pattern Recognition
Machine Learning
We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.
title Unified Multimodal Uncertain Inference
topic Computer Vision and Pattern Recognition
Machine Learning
url https://arxiv.org/abs/2604.08701