Salvato in:
Dettagli Bibliografici
Autori principali: Liu, Mingxin, Ma, Shuran, Meng, Shibei, Zhao, Xiangyu, Zhang, Zicheng, Zhang, Shaofeng, Zhong, Zhihang, Chen, Peixian, Cao, Haoyu, Sun, Xing, Duan, Haodong, Yang, Xue
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2602.05986
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
Sommario:
  • While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.