Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Dengjia, Martin, Alexander, Jurayj, William, Murray, Kenton, Van Durme, Benjamin, Kriz, Reno
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2604.08701
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915934507106304
author	Zhang, Dengjia Martin, Alexander Jurayj, William Murray, Kenton Van Durme, Benjamin Kriz, Reno
author_facet	Zhang, Dengjia Martin, Alexander Jurayj, William Murray, Kenton Van Durme, Benjamin Kriz, Reno
contents	We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_08701
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Unified Multimodal Uncertain Inference Zhang, Dengjia Martin, Alexander Jurayj, William Murray, Kenton Van Durme, Benjamin Kriz, Reno Computer Vision and Pattern Recognition Machine Learning We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.
title	Unified Multimodal Uncertain Inference
topic	Computer Vision and Pattern Recognition Machine Learning
url	https://arxiv.org/abs/2604.08701

Similar Items