:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	He, Yangfan, Boo, Changgyu, Yoon, Jaehong
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.10652
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
by: Yu, Shoubin, et al.
Published: (2024)

Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
by: Lee, Daeun, et al.
Published: (2025)

RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives
by: Yoon, Jaehong, et al.
Published: (2024)

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
by: Wang, Ziyang, et al.
Published: (2025)

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
by: Wang, Ziyang, et al.
Published: (2024)

Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement
by: Lee, Daeun, et al.
Published: (2024)

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
by: Yeo, Woongyeong, et al.
Published: (2025)

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
by: Wang, Ziyang, et al.
Published: (2026)

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
by: Yu, Shoubin, et al.
Published: (2025)

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance
by: Wang, Zun, et al.
Published: (2025)

AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories
by: Wang, Zun, et al.
Published: (2026)

DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation
by: Wang, Zun, et al.
Published: (2024)

Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models
by: Jang, Sangwon, et al.
Published: (2025)

Continual Learning: Forget-free Winning Subnetworks for Video Representations
by: Kang, Haeyong, et al.
Published: (2023)

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning
by: Yu, Shoubin, et al.
Published: (2026)

SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation
by: Yoon, Jaehong, et al.
Published: (2024)

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
by: Li, Jialu, et al.
Published: (2025)

ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models
by: Sung, Yi-Lin, et al.
Published: (2023)

Progressive Fourier Neural Representation for Sequential Video Compilation
by: Kang, Haeyong, et al.
Published: (2023)

SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
by: Qin, Guanyi, et al.
Published: (2026)

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
by: Guo, Ziyu, et al.
Published: (2025)

DART: Leveraging Multi-Agent Disagreement for Tool Recruitment in Multimodal Reasoning
by: Sivakumaran, Nithin, et al.
Published: (2025)

Planning with Sketch-Guided Verification for Physics-Aware Video Generation
by: Huang, Yidong, et al.
Published: (2025)

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
by: Huang, Yidong, et al.
Published: (2026)

Hierarchy-Aware Multimodal Unlearning for Medical AI
by: Wu, Fengli, et al.
Published: (2025)

DDPM-MoCo: Advancing Industrial Surface Defect Generation and Detection with Generative and Contrastive Learning
by: He, Yangfan, et al.
Published: (2024)

ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant
by: Xiang, Yifan, et al.
Published: (2025)

Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
by: Reichman, Benjamin, et al.
Published: (2025)

ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?
by: Han, Haonan, et al.
Published: (2026)

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
by: Wang, Xiyao, et al.
Published: (2024)

Flatten: Video Action Recognition is an Image Classification task
by: Chen, Junlin, et al.
Published: (2024)

Free-Mask: A Novel Paradigm of Integration Between the Segmentation Diffusion Model and Image Editing
by: Gao, Bo, et al.
Published: (2024)

CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models
by: Cao, Zongsheng, et al.
Published: (2025)

SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
by: Li, Jialu, et al.
Published: (2024)

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
by: Yang, Cheng, et al.
Published: (2025)

Reflective Human-Machine Co-adaptation for Enhanced Text-to-Image Generation Dialogue System
by: Feng, Yuheng, et al.
Published: (2024)

PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models
by: Mak, Chak-Wing, et al.
Published: (2026)

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
by: Deng, Andong, et al.
Published: (2025)

GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning
by: Wu, Fengyi, et al.
Published: (2025)

Demystifying Video Reasoning
by: Wang, Ruisi, et al.
Published: (2026)