Saved in:
| Main Authors: | Jiang, Yifan, Wang, Yueying, Zhao, Rui, Parag, Toufiq, Chen, Zhimin, Liao, Zhenyu, Unnikrishnan, Jayakrishnan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.11113 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VeRVE: Versatile Retrieval for Videos via Unified Embeddings
by: Halbe, Shaunak, et al.
Published: (2026)
by: Halbe, Shaunak, et al.
Published: (2026)
Modality Agnostic Efficient Long Range Encoder
by: Parag, Toufiq, et al.
Published: (2025)
by: Parag, Toufiq, et al.
Published: (2025)
Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection
by: Yakun, Cui, et al.
Published: (2025)
by: Yakun, Cui, et al.
Published: (2025)
Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models
by: Agarwal, Sakshi, et al.
Published: (2026)
by: Agarwal, Sakshi, et al.
Published: (2026)
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
by: Liao, Zhenyi, et al.
Published: (2025)
by: Liao, Zhenyi, et al.
Published: (2025)
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding
by: Li, Yueying, et al.
Published: (2026)
by: Li, Yueying, et al.
Published: (2026)
Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench
by: Hu, Lanxiang, et al.
Published: (2025)
by: Hu, Lanxiang, et al.
Published: (2025)
AD-MIR: Bridging the Gap from Perception to Persuasion in Advertising Video Understanding via Structured Reasoning
by: Xu, Binxiao, et al.
Published: (2026)
by: Xu, Binxiao, et al.
Published: (2026)
Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding
by: Chen, Houlun, et al.
Published: (2026)
by: Chen, Houlun, et al.
Published: (2026)
Personalized Video Summarization by Multimodal Video Understanding
by: Chen, Brian, et al.
Published: (2024)
by: Chen, Brian, et al.
Published: (2024)
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
by: Fei, Hao, et al.
Published: (2024)
by: Fei, Hao, et al.
Published: (2024)
Abductive Ego-View Accident Video Understanding for Safe Driving Perception
by: Fang, Jianwu, et al.
Published: (2024)
by: Fang, Jianwu, et al.
Published: (2024)
DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning
by: Gao, Yifeng, et al.
Published: (2025)
by: Gao, Yifeng, et al.
Published: (2025)
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
by: Deng, Andong, et al.
Published: (2024)
by: Deng, Andong, et al.
Published: (2024)
Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding
by: Tang, Jiaqi, et al.
Published: (2025)
by: Tang, Jiaqi, et al.
Published: (2025)
A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning
by: Jiang, Siyang, et al.
Published: (2025)
by: Jiang, Siyang, et al.
Published: (2025)
CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning
by: Gan, Rui, et al.
Published: (2026)
by: Gan, Rui, et al.
Published: (2026)
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
by: Kong, Fanqi, et al.
Published: (2025)
by: Kong, Fanqi, et al.
Published: (2025)
Universal Visuo-Tactile Video Understanding for Embodied Interaction
by: Xie, Yifan, et al.
Published: (2025)
by: Xie, Yifan, et al.
Published: (2025)
Audio-centric Video Understanding Benchmark without Text Shortcut
by: Yang, Yudong, et al.
Published: (2025)
by: Yang, Yudong, et al.
Published: (2025)
EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports
by: Ma, Jianzhe, et al.
Published: (2026)
by: Ma, Jianzhe, et al.
Published: (2026)
R3G: A Reasoning--Retrieval--Reranking Framework for Vision-Centric Answer Generation
by: Chen, Zhuohong, et al.
Published: (2026)
by: Chen, Zhuohong, et al.
Published: (2026)
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
by: Zhao, Bingchen, et al.
Published: (2024)
by: Zhao, Bingchen, et al.
Published: (2024)
Automated Segmentation of Ischemic Stroke Lesions in Non-Contrast Computed Tomography Images for Enhanced Treatment and Prognosis
by: Musah, Toufiq, et al.
Published: (2024)
by: Musah, Toufiq, et al.
Published: (2024)
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
by: Li, Yunxin, et al.
Published: (2024)
by: Li, Yunxin, et al.
Published: (2024)
Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding
by: Zheng, Yikai, et al.
Published: (2026)
by: Zheng, Yikai, et al.
Published: (2026)
Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion
by: Gu, Bohai, et al.
Published: (2026)
by: Gu, Bohai, et al.
Published: (2026)
Enhancing Long Video Understanding via Hierarchical Event-Based Memory
by: Cheng, Dingxin, et al.
Published: (2024)
by: Cheng, Dingxin, et al.
Published: (2024)
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
by: Schneider, Benjamin, et al.
Published: (2025)
by: Schneider, Benjamin, et al.
Published: (2025)
GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery
by: Wang, Fengxiang, et al.
Published: (2026)
by: Wang, Fengxiang, et al.
Published: (2026)
VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting
by: Lee, Daeun, et al.
Published: (2026)
by: Lee, Daeun, et al.
Published: (2026)
H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding
by: Wu, Qi, et al.
Published: (2025)
by: Wu, Qi, et al.
Published: (2025)
VideoPrism: A Foundational Visual Encoder for Video Understanding
by: Zhao, Long, et al.
Published: (2024)
by: Zhao, Long, et al.
Published: (2024)
CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation
by: Tong, Chengzhuo, et al.
Published: (2026)
by: Tong, Chengzhuo, et al.
Published: (2026)
Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward
by: Xiao, Tong, et al.
Published: (2025)
by: Xiao, Tong, et al.
Published: (2025)
Vamos: Versatile Action Models for Video Understanding
by: Wang, Shijie, et al.
Published: (2023)
by: Wang, Shijie, et al.
Published: (2023)
RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph
by: Malik, Sameer, et al.
Published: (2025)
by: Malik, Sameer, et al.
Published: (2025)
Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning
by: Zhang, Fanrui, et al.
Published: (2025)
by: Zhang, Fanrui, et al.
Published: (2025)
TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning
by: Pan, Junwen, et al.
Published: (2025)
by: Pan, Junwen, et al.
Published: (2025)
R^3-VQA: "Read the Room" by Video Social Reasoning
by: Niu, Lixing, et al.
Published: (2025)
by: Niu, Lixing, et al.
Published: (2025)
Similar Items
-
VeRVE: Versatile Retrieval for Videos via Unified Embeddings
by: Halbe, Shaunak, et al.
Published: (2026) -
Modality Agnostic Efficient Long Range Encoder
by: Parag, Toufiq, et al.
Published: (2025) -
Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection
by: Yakun, Cui, et al.
Published: (2025) -
Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models
by: Agarwal, Sakshi, et al.
Published: (2026) -
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
by: Liao, Zhenyi, et al.
Published: (2025)