:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Deng, Hokin
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2512.05969
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
by: Yang, Cheng, et al.
Published: (2025)

Large Vision Models Can Solve Mental Rotation Problems
by: Mason, Sebastian Ray, et al.
Published: (2025)

Egocentric Bias in Vision-Language Models
by: Wang, Maijunxian, et al.
Published: (2026)

Demystifying Video Reasoning
by: Wang, Ruisi, et al.
Published: (2026)

What Makes a Maze Look Like a Maze?
by: Hsu, Joy, et al.
Published: (2024)

Core Knowledge Deficits in Multi-Modal Language Models
by: Li, Yijiang, et al.
Published: (2024)

Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model
by: Zhao, Min, et al.
Published: (2024)

Deep Learning Methods for Abstract Visual Reasoning: A Survey on Raven's Progressive Matrices
by: Małkiński, Mikołaj, et al.
Published: (2022)

Video Models Reason Early: Exploiting Plan Commitment for Maze Solving
by: Newman, Kaleb, et al.
Published: (2026)

Probing Perceptual Constancy in Large Vision-Language Models
by: Sun, Haoran, et al.
Published: (2025)

An Ordinary Differential Equation Sampler with Stochastic Start for Diffusion Bridge Models
by: Wang, Yuang, et al.
Published: (2024)

The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs
by: Li, Hong, et al.
Published: (2024)

Solving Video Inverse Problems Using Image Diffusion Models
by: Kwon, Taesung, et al.
Published: (2024)

VideoPDE: Unified Generative PDE Solving via Video Inpainting Diffusion Models
by: Li, Edward, et al.
Published: (2025)

Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model
by: Yang, Yang, et al.
Published: (2025)

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
by: Deng, Andong, et al.
Published: (2025)

Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models
by: Daras, Giannis, et al.
Published: (2024)

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos
by: Fei, Jiajun, et al.
Published: (2024)

PhyGround: Benchmarking Physical Reasoning in Generative World Models
by: Lin, Juyi, et al.
Published: (2026)

Bias Detection and Rotation-Robustness Mitigation in Vision-Language Models and Generative Image Models
by: Mithila, Tarannum
Published: (2026)

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning
by: Luo, Ruilin, et al.
Published: (2026)

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference
by: Kang, Beomseok, et al.
Published: (2026)

Semi-Supervised Coupled Thin-Plate Spline Model for Rotation Correction and Beyond
by: Nie, Lang, et al.
Published: (2024)

Towards Robust Probabilistic Modeling on SO(3) via Rotation Laplace Distribution
by: Yin, Yingda, et al.
Published: (2023)

Rethinking Video Generation Model for the Embodied World
by: Deng, Yufan, et al.
Published: (2026)

Beyond Cropping and Rotation: Automated Evolution of Powerful Task-Specific Augmentations with Generative Models
by: Goldfeder, Judah, et al.
Published: (2026)

Video Occupancy Models
by: Tomar, Manan, et al.
Published: (2024)

Learning Image Priors through Patch-based Diffusion Models for Solving Inverse Problems
by: Hu, Jason, et al.
Published: (2024)

GVD: Guiding Video Diffusion Model for Scalable Video Distillation
by: Li, Kunyang, et al.
Published: (2025)

DeVAn: Dense Video Annotation for Video-Language Models
by: Liu, Tingkai, et al.
Published: (2023)

An Organism Starts with a Single Pix-Cell: A Neural Cellular Diffusion for High-Resolution Image Synthesis
by: Elbatel, Marawan, et al.
Published: (2024)

PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving
by: Zhang, Zeyu, et al.
Published: (2025)

VG4D: Vision-Language Model Goes 4D Video Recognition
by: Deng, Zhichao, et al.
Published: (2024)

Relaxed Rotational Equivariance via $G$-Biases in Vision
by: Wu, Zhiqiang, et al.
Published: (2024)

MindCube: Spatial Mental Modeling from Limited Views
by: Wang, Qineng, et al.
Published: (2025)

Can Vision-Language Models Solve Visual Math Equations?
by: Choudhury, Monjoy Narayan, et al.
Published: (2025)

Latent Intuitive Physics: Learning to Transfer Hidden Physics from A 3D Video
by: Zhu, Xiangming, et al.
Published: (2024)

Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space
by: Zhu, Jian, et al.
Published: (2025)

ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices
by: Yu, Hao, et al.
Published: (2025)

VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding
by: Zhang, Zhihong, et al.
Published: (2025)