Saved in:
| Main Authors: | Gao, Hong, Bao, Yiming, Tu, Xuezhen, Xu, Yutong, Jin, Yue, Mu, Yiyang, Zhong, Bin, Yue, Linan, Zhang, Min-Ling |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.14446 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval
by: Gao, Hong, et al.
Published: (2025)
by: Gao, Hong, et al.
Published: (2025)
Training Multimodal Large Reasoning Models Needs Better Thoughts: A Three-Stage Framework for Long Chain-of-Thought Synthesis and Selection
by: Wang, Yizhi, et al.
Published: (2025)
by: Wang, Yizhi, et al.
Published: (2025)
An Efficient Streaming Video Understanding Framework with Agentic Control
by: Liu, Jinming, et al.
Published: (2026)
by: Liu, Jinming, et al.
Published: (2026)
Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models
by: Wang, Yizhi, et al.
Published: (2026)
by: Wang, Yizhi, et al.
Published: (2026)
A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding
by: Zhang, Yue, et al.
Published: (2026)
by: Zhang, Yue, et al.
Published: (2026)
Manifold-Aware Exploration for Reinforcement Learning in Video Generation
by: Zheng, Mingzhe, et al.
Published: (2026)
by: Zheng, Mingzhe, et al.
Published: (2026)
SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse
by: Sun, Yiming, et al.
Published: (2025)
by: Sun, Yiming, et al.
Published: (2025)
Agentic Very Long Video Understanding
by: Rege, Aniket, et al.
Published: (2026)
by: Rege, Aniket, et al.
Published: (2026)
VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos
by: Liu, Wenqi, et al.
Published: (2026)
by: Liu, Wenqi, et al.
Published: (2026)
VideoExplorer: Think With Videos For Agentic Long-Video Understanding
by: Yuan, Huaying, et al.
Published: (2025)
by: Yuan, Huaying, et al.
Published: (2025)
LensWalk: Agentic Video Understanding by Planning How You See in Videos
by: Li, Keliang, et al.
Published: (2026)
by: Li, Keliang, et al.
Published: (2026)
Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding
by: Zhong, Yutong
Published: (2025)
by: Zhong, Yutong
Published: (2025)
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
by: Zhang, Xiaoyi, et al.
Published: (2025)
by: Zhang, Xiaoyi, et al.
Published: (2025)
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
by: Tu, Xuezhen, et al.
Published: (2026)
by: Tu, Xuezhen, et al.
Published: (2026)
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
by: Fan, Yue, et al.
Published: (2024)
by: Fan, Yue, et al.
Published: (2024)
OwlSight: A Robust Illumination Adaptation Framework for Dark Video Human Action Recognition
by: Cheng, Shihao, et al.
Published: (2025)
by: Cheng, Shihao, et al.
Published: (2025)
ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment
by: Chen, Yiyang, et al.
Published: (2025)
by: Chen, Yiyang, et al.
Published: (2025)
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
by: Zhao, Yiming, et al.
Published: (2026)
by: Zhao, Yiming, et al.
Published: (2026)
LumiVideo: An Intelligent Agentic System for Video Color Grading
by: Guo, Yuchen, et al.
Published: (2026)
by: Guo, Yuchen, et al.
Published: (2026)
Code2MCP: Transforming Code Repositories into MCP Services
by: Ouyang, Chaoqian, et al.
Published: (2025)
by: Ouyang, Chaoqian, et al.
Published: (2025)
Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA
by: Wu, Zexi, et al.
Published: (2026)
by: Wu, Zexi, et al.
Published: (2026)
A Unified Framework for Human-centric Point Cloud Video Understanding
by: Xu, Yiteng, et al.
Published: (2024)
by: Xu, Yiteng, et al.
Published: (2024)
MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer
by: Liu, Penghui, et al.
Published: (2025)
by: Liu, Penghui, et al.
Published: (2025)
Guided by Trajectories: Repairing and Rewarding Tool-Use Trajectories for Tool-Integrated Reasoning
by: Gong, Siyu, et al.
Published: (2026)
by: Gong, Siyu, et al.
Published: (2026)
The Dynamic Prior: Understanding 3D Structures for Casual Dynamic Videos
by: Wu, Zhuoyuan, et al.
Published: (2025)
by: Wu, Zhuoyuan, et al.
Published: (2025)
VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models
by: Wang, Jiapeng, et al.
Published: (2024)
by: Wang, Jiapeng, et al.
Published: (2024)
Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT
by: Liu, Dongyang, et al.
Published: (2025)
by: Liu, Dongyang, et al.
Published: (2025)
Hybrid 3D Human Pose Estimation with Monocular Video and Sparse IMUs
by: Bao, Yiming, et al.
Published: (2024)
by: Bao, Yiming, et al.
Published: (2024)
VideoNSA: Native Sparse Attention Scales Video Understanding
by: Song, Enxin, et al.
Published: (2025)
by: Song, Enxin, et al.
Published: (2025)
Apollo: An Exploration of Video Understanding in Large Multimodal Models
by: Zohar, Orr, et al.
Published: (2024)
by: Zohar, Orr, et al.
Published: (2024)
Collaborative Learning of On-Device Small Model and Cloud-Based Large Model: Advances and Future Directions
by: Niu, Chaoyue, et al.
Published: (2025)
by: Niu, Chaoyue, et al.
Published: (2025)
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
by: Wang, Ziyang, et al.
Published: (2025)
by: Wang, Ziyang, et al.
Published: (2025)
VideoCoF: Unified Video Editing with Temporal Reasoner
by: Yang, Xiangpeng, et al.
Published: (2025)
by: Yang, Xiangpeng, et al.
Published: (2025)
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding
by: Wang, Ziyi, et al.
Published: (2025)
by: Wang, Ziyi, et al.
Published: (2025)
Preacher: Paper-to-Video Agentic System
by: Liu, Jingwei, et al.
Published: (2025)
by: Liu, Jingwei, et al.
Published: (2025)
DexImit: Learning Bimanual Dexterous Manipulation from Monocular Human Videos
by: Mu, Juncheng, et al.
Published: (2026)
by: Mu, Juncheng, et al.
Published: (2026)
TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler
by: Zhang, Xingjian, et al.
Published: (2025)
by: Zhang, Xingjian, et al.
Published: (2025)
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
by: Zhou, Yiyang, et al.
Published: (2025)
by: Zhou, Yiyang, et al.
Published: (2025)
Agentic Explainable Artificial Intelligence (Agentic XAI) Approach To Explore Better Explanation
by: Yamaguchi, Tomoaki, et al.
Published: (2025)
by: Yamaguchi, Tomoaki, et al.
Published: (2025)
VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
by: Yin, Yufei, et al.
Published: (2025)
by: Yin, Yufei, et al.
Published: (2025)
Similar Items
-
APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval
by: Gao, Hong, et al.
Published: (2025) -
Training Multimodal Large Reasoning Models Needs Better Thoughts: A Three-Stage Framework for Long Chain-of-Thought Synthesis and Selection
by: Wang, Yizhi, et al.
Published: (2025) -
An Efficient Streaming Video Understanding Framework with Agentic Control
by: Liu, Jinming, et al.
Published: (2026) -
Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models
by: Wang, Yizhi, et al.
Published: (2026) -
A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding
by: Zhang, Yue, et al.
Published: (2026)