Saved in:
Bibliografiske detaljer
Main Authors: Dai, Yifan, Wu, Zhenhua, Zeng, Bohan, Hua, Daili, Liu, Jialing, Li, Bozhou, Wang, Yuran, Tong, Chengzhuo, Liang, Hao, Ma, Xiaochen, Niu, Junbo, Guo, Tianyu, Shi, Yang, Ding, Yue, Ji, Yiyan, Mei, Bingyin, Guan, Yushuo, Zhang, Yuanxing, Wan, Pengfei, Fu, Fangcheng, Zhang, Wentao
Format: Preprint
Udgivet: 2026
Fag:
Online adgang:https://arxiv.org/abs/2605.22012
Tags: Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!
Indholdsfortegnelse:
  • Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \textbf{LatentOmni-Instruct-35K}, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.