:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Gao, Hong, Bao, Yiming, Tu, Xuezhen, Xu, Yutong, Jin, Yue, Mu, Yiyang, Zhong, Bin, Yue, Linan, Zhang, Min-Ling
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2511.14446
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval
by: Gao, Hong, et al.
Published: (2025)

Training Multimodal Large Reasoning Models Needs Better Thoughts: A Three-Stage Framework for Long Chain-of-Thought Synthesis and Selection
by: Wang, Yizhi, et al.
Published: (2025)

An Efficient Streaming Video Understanding Framework with Agentic Control
by: Liu, Jinming, et al.
Published: (2026)

Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models
by: Wang, Yizhi, et al.
Published: (2026)

A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding
by: Zhang, Yue, et al.
Published: (2026)

Manifold-Aware Exploration for Reinforcement Learning in Video Generation
by: Zheng, Mingzhe, et al.
Published: (2026)

SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse
by: Sun, Yiming, et al.
Published: (2025)

Agentic Very Long Video Understanding
by: Rege, Aniket, et al.
Published: (2026)

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos
by: Liu, Wenqi, et al.
Published: (2026)

VideoExplorer: Think With Videos For Agentic Long-Video Understanding
by: Yuan, Huaying, et al.
Published: (2025)

LensWalk: Agentic Video Understanding by Planning How You See in Videos
by: Li, Keliang, et al.
Published: (2026)

Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding
by: Zhong, Yutong
Published: (2025)

Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
by: Zhang, Xiaoyi, et al.
Published: (2025)

Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
by: Tu, Xuezhen, et al.
Published: (2026)

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
by: Fan, Yue, et al.
Published: (2024)

OwlSight: A Robust Illumination Adaptation Framework for Dark Video Human Action Recognition
by: Cheng, Shihao, et al.
Published: (2025)

ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment
by: Chen, Yiyang, et al.
Published: (2025)

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
by: Zhao, Yiming, et al.
Published: (2026)

LumiVideo: An Intelligent Agentic System for Video Color Grading
by: Guo, Yuchen, et al.
Published: (2026)

Code2MCP: Transforming Code Repositories into MCP Services
by: Ouyang, Chaoqian, et al.
Published: (2025)

Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA
by: Wu, Zexi, et al.
Published: (2026)

A Unified Framework for Human-centric Point Cloud Video Understanding
by: Xu, Yiteng, et al.
Published: (2024)

MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer
by: Liu, Penghui, et al.
Published: (2025)

Guided by Trajectories: Repairing and Rewarding Tool-Use Trajectories for Tool-Integrated Reasoning
by: Gong, Siyu, et al.
Published: (2026)

The Dynamic Prior: Understanding 3D Structures for Casual Dynamic Videos
by: Wu, Zhuoyuan, et al.
Published: (2025)

VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models
by: Wang, Jiapeng, et al.
Published: (2024)

Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT
by: Liu, Dongyang, et al.
Published: (2025)

Hybrid 3D Human Pose Estimation with Monocular Video and Sparse IMUs
by: Bao, Yiming, et al.
Published: (2024)

VideoNSA: Native Sparse Attention Scales Video Understanding
by: Song, Enxin, et al.
Published: (2025)

Apollo: An Exploration of Video Understanding in Large Multimodal Models
by: Zohar, Orr, et al.
Published: (2024)

Collaborative Learning of On-Device Small Model and Cloud-Based Large Model: Advances and Future Directions
by: Niu, Chaoyue, et al.
Published: (2025)

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
by: Wang, Ziyang, et al.
Published: (2025)

VideoCoF: Unified Video Editing with Temporal Reasoner
by: Yang, Xiangpeng, et al.
Published: (2025)

LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding
by: Wang, Ziyi, et al.
Published: (2025)

Preacher: Paper-to-Video Agentic System
by: Liu, Jingwei, et al.
Published: (2025)

DexImit: Learning Bimanual Dexterous Manipulation from Monocular Human Videos
by: Mu, Juncheng, et al.
Published: (2026)

TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler
by: Zhang, Xingjian, et al.
Published: (2025)

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
by: Zhou, Yiyang, et al.
Published: (2025)

Agentic Explainable Artificial Intelligence (Agentic XAI) Approach To Explore Better Explanation
by: Yamaguchi, Tomoaki, et al.
Published: (2025)

VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
by: Yin, Yufei, et al.
Published: (2025)