Saved in:
| Main Authors: | He, Yangfan, Boo, Changgyu, Yoon, Jaehong |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.10652 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
by: Yu, Shoubin, et al.
Published: (2024)
by: Yu, Shoubin, et al.
Published: (2024)
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
by: Lee, Daeun, et al.
Published: (2025)
by: Lee, Daeun, et al.
Published: (2025)
RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives
by: Yoon, Jaehong, et al.
Published: (2024)
by: Yoon, Jaehong, et al.
Published: (2024)
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
by: Wang, Ziyang, et al.
Published: (2025)
by: Wang, Ziyang, et al.
Published: (2025)
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
by: Wang, Ziyang, et al.
Published: (2024)
by: Wang, Ziyang, et al.
Published: (2024)
Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement
by: Lee, Daeun, et al.
Published: (2024)
by: Lee, Daeun, et al.
Published: (2024)
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
by: Yeo, Woongyeong, et al.
Published: (2025)
by: Yeo, Woongyeong, et al.
Published: (2025)
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
by: Wang, Ziyang, et al.
Published: (2026)
by: Wang, Ziyang, et al.
Published: (2026)
MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
by: Yu, Shoubin, et al.
Published: (2025)
by: Yu, Shoubin, et al.
Published: (2025)
EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance
by: Wang, Zun, et al.
Published: (2025)
by: Wang, Zun, et al.
Published: (2025)
AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories
by: Wang, Zun, et al.
Published: (2026)
by: Wang, Zun, et al.
Published: (2026)
DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation
by: Wang, Zun, et al.
Published: (2024)
by: Wang, Zun, et al.
Published: (2024)
Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models
by: Jang, Sangwon, et al.
Published: (2025)
by: Jang, Sangwon, et al.
Published: (2025)
Continual Learning: Forget-free Winning Subnetworks for Video Representations
by: Kang, Haeyong, et al.
Published: (2023)
by: Kang, Haeyong, et al.
Published: (2023)
When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning
by: Yu, Shoubin, et al.
Published: (2026)
by: Yu, Shoubin, et al.
Published: (2026)
SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation
by: Yoon, Jaehong, et al.
Published: (2024)
by: Yoon, Jaehong, et al.
Published: (2024)
Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
by: Li, Jialu, et al.
Published: (2025)
by: Li, Jialu, et al.
Published: (2025)
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models
by: Sung, Yi-Lin, et al.
Published: (2023)
by: Sung, Yi-Lin, et al.
Published: (2023)
Progressive Fourier Neural Representation for Sequential Video Compilation
by: Kang, Haeyong, et al.
Published: (2023)
by: Kang, Haeyong, et al.
Published: (2023)
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
by: Qin, Guanyi, et al.
Published: (2026)
by: Qin, Guanyi, et al.
Published: (2026)
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
by: Guo, Ziyu, et al.
Published: (2025)
by: Guo, Ziyu, et al.
Published: (2025)
DART: Leveraging Multi-Agent Disagreement for Tool Recruitment in Multimodal Reasoning
by: Sivakumaran, Nithin, et al.
Published: (2025)
by: Sivakumaran, Nithin, et al.
Published: (2025)
Planning with Sketch-Guided Verification for Physics-Aware Video Generation
by: Huang, Yidong, et al.
Published: (2025)
by: Huang, Yidong, et al.
Published: (2025)
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
by: Huang, Yidong, et al.
Published: (2026)
by: Huang, Yidong, et al.
Published: (2026)
Hierarchy-Aware Multimodal Unlearning for Medical AI
by: Wu, Fengli, et al.
Published: (2025)
by: Wu, Fengli, et al.
Published: (2025)
DDPM-MoCo: Advancing Industrial Surface Defect Generation and Detection with Generative and Contrastive Learning
by: He, Yangfan, et al.
Published: (2024)
by: He, Yangfan, et al.
Published: (2024)
ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant
by: Xiang, Yifan, et al.
Published: (2025)
by: Xiang, Yifan, et al.
Published: (2025)
Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
by: Reichman, Benjamin, et al.
Published: (2025)
by: Reichman, Benjamin, et al.
Published: (2025)
ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?
by: Han, Haonan, et al.
Published: (2026)
by: Han, Haonan, et al.
Published: (2026)
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
by: Wang, Xiyao, et al.
Published: (2024)
by: Wang, Xiyao, et al.
Published: (2024)
Flatten: Video Action Recognition is an Image Classification task
by: Chen, Junlin, et al.
Published: (2024)
by: Chen, Junlin, et al.
Published: (2024)
Free-Mask: A Novel Paradigm of Integration Between the Segmentation Diffusion Model and Image Editing
by: Gao, Bo, et al.
Published: (2024)
by: Gao, Bo, et al.
Published: (2024)
CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models
by: Cao, Zongsheng, et al.
Published: (2025)
by: Cao, Zongsheng, et al.
Published: (2025)
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
by: Li, Jialu, et al.
Published: (2024)
by: Li, Jialu, et al.
Published: (2024)
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
by: Yang, Cheng, et al.
Published: (2025)
by: Yang, Cheng, et al.
Published: (2025)
Reflective Human-Machine Co-adaptation for Enhanced Text-to-Image Generation Dialogue System
by: Feng, Yuheng, et al.
Published: (2024)
by: Feng, Yuheng, et al.
Published: (2024)
PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models
by: Mak, Chak-Wing, et al.
Published: (2026)
by: Mak, Chak-Wing, et al.
Published: (2026)
SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
by: Deng, Andong, et al.
Published: (2025)
by: Deng, Andong, et al.
Published: (2025)
GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning
by: Wu, Fengyi, et al.
Published: (2025)
by: Wu, Fengyi, et al.
Published: (2025)
Demystifying Video Reasoning
by: Wang, Ruisi, et al.
Published: (2026)
by: Wang, Ruisi, et al.
Published: (2026)
Similar Items
-
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
by: Yu, Shoubin, et al.
Published: (2024) -
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
by: Lee, Daeun, et al.
Published: (2025) -
RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives
by: Yoon, Jaehong, et al.
Published: (2024) -
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
by: Wang, Ziyang, et al.
Published: (2025) -
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
by: Wang, Ziyang, et al.
Published: (2024)