Saved in:
| Main Authors: | Zhang, Han, Jiang, Wanting, Kornuta, Tomasz, Zheng, Tian, Murali, Vidya |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.21917 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation
by: Li, Shuowei, et al.
Published: (2026)
by: Li, Shuowei, et al.
Published: (2026)
MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning
by: Jiang, Zheng, et al.
Published: (2026)
by: Jiang, Zheng, et al.
Published: (2026)
MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering
by: Dang, Jisheng, et al.
Published: (2025)
by: Dang, Jisheng, et al.
Published: (2025)
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
by: Li, Chenglin, et al.
Published: (2026)
by: Li, Chenglin, et al.
Published: (2026)
MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network
by: Ahire, Vrushank, et al.
Published: (2025)
by: Ahire, Vrushank, et al.
Published: (2025)
MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning
by: Liu, Shengyuan, et al.
Published: (2026)
by: Liu, Shengyuan, et al.
Published: (2026)
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
by: Yang, Cheng, et al.
Published: (2025)
by: Yang, Cheng, et al.
Published: (2025)
Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning
by: Liu, Chengwen, et al.
Published: (2026)
by: Liu, Chengwen, et al.
Published: (2026)
Beyond Perception: Evaluating Abstract Visual Reasoning through Multi-Stage Task
by: Jiang, Yanbei, et al.
Published: (2025)
by: Jiang, Yanbei, et al.
Published: (2025)
Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning
by: Li, Jiazheng, et al.
Published: (2026)
by: Li, Jiazheng, et al.
Published: (2026)
VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos
by: Liu, Wenqi, et al.
Published: (2026)
by: Liu, Wenqi, et al.
Published: (2026)
ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis
by: Zhang, Congzhi, et al.
Published: (2025)
by: Zhang, Congzhi, et al.
Published: (2025)
TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos
by: Liu, Xiangrui, et al.
Published: (2025)
by: Liu, Xiangrui, et al.
Published: (2025)
VISD: Enhancing Video Reasoning via Structured Self-Distillation
by: Lin, Hao, et al.
Published: (2026)
by: Lin, Hao, et al.
Published: (2026)
Process-of-Thought Reasoning for Videos
by: Zhang, Jusheng, et al.
Published: (2026)
by: Zhang, Jusheng, et al.
Published: (2026)
VideoExplorer: Think With Videos For Agentic Long-Video Understanding
by: Yuan, Huaying, et al.
Published: (2025)
by: Yuan, Huaying, et al.
Published: (2025)
Tri-Reader: An Open-Access, Multi-Stage AI Pipeline for First-Pass Lung Nodule Annotation in Screening CT
by: Tushar, Fakrul Islam, et al.
Published: (2026)
by: Tushar, Fakrul Islam, et al.
Published: (2026)
Video Finetuning Improves Reasoning Between Frames
by: Yang, Ruiqi, et al.
Published: (2025)
by: Yang, Ruiqi, et al.
Published: (2025)
AVA: Towards Agentic Video Analytics with Vision Language Models
by: Yan, Yuxuan, et al.
Published: (2025)
by: Yan, Yuxuan, et al.
Published: (2025)
DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning
by: Zou, Junbo, et al.
Published: (2025)
by: Zou, Junbo, et al.
Published: (2025)
M$^3$-Med: A Benchmark for Multi-lingual, Multi-modal, and Multi-hop Reasoning in Medical Instructional Video Understanding
by: Liu, Shenxi, et al.
Published: (2025)
by: Liu, Shenxi, et al.
Published: (2025)
TopoLogic: An Interpretable Pipeline for Lane Topology Reasoning on Driving Scenes
by: Fu, Yanping, et al.
Published: (2024)
by: Fu, Yanping, et al.
Published: (2024)
Preacher: Paper-to-Video Agentic System
by: Liu, Jingwei, et al.
Published: (2025)
by: Liu, Jingwei, et al.
Published: (2025)
Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding
by: Gao, Hong, et al.
Published: (2025)
by: Gao, Hong, et al.
Published: (2025)
EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation
by: Yang, Songlin, et al.
Published: (2026)
by: Yang, Songlin, et al.
Published: (2026)
Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
by: Li, Chenjun
Published: (2026)
by: Li, Chenjun
Published: (2026)
End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
by: Zheng, Qiaoyu, et al.
Published: (2025)
by: Zheng, Qiaoyu, et al.
Published: (2025)
LumiVideo: An Intelligent Agentic System for Video Color Grading
by: Guo, Yuchen, et al.
Published: (2026)
by: Guo, Yuchen, et al.
Published: (2026)
DeVAn: Dense Video Annotation for Video-Language Models
by: Liu, Tingkai, et al.
Published: (2023)
by: Liu, Tingkai, et al.
Published: (2023)
FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks
by: Takahashi, Jun, et al.
Published: (2025)
by: Takahashi, Jun, et al.
Published: (2025)
UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks
by: Nguyen, Jason, et al.
Published: (2026)
by: Nguyen, Jason, et al.
Published: (2026)
VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority
by: Qiu, Chenhao, et al.
Published: (2026)
by: Qiu, Chenhao, et al.
Published: (2026)
The 9th AI City Challenge
by: Tang, Zheng, et al.
Published: (2025)
by: Tang, Zheng, et al.
Published: (2025)
Multi-Task Learning with Multi-Annotation Triplet Loss for Improved Object Detection
by: Zhou, Meilun, et al.
Published: (2025)
by: Zhou, Meilun, et al.
Published: (2025)
Agentic Pipeline for Self-Synchronized Multiview Joint Angle Monitoring in Uncalibrated Environments
by: Yu, Juncheng, et al.
Published: (2026)
by: Yu, Juncheng, et al.
Published: (2026)
Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment
by: Cai, Zhuoxuan, et al.
Published: (2025)
by: Cai, Zhuoxuan, et al.
Published: (2025)
GRASPTrack: Geometry-Reasoned Association via Segmentation and Projection for Multi-Object Tracking
by: Han, Xudong, et al.
Published: (2025)
by: Han, Xudong, et al.
Published: (2025)
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
by: Xu, Wanting, et al.
Published: (2024)
by: Xu, Wanting, et al.
Published: (2024)
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks
by: Hu, Wenbo, et al.
Published: (2026)
by: Hu, Wenbo, et al.
Published: (2026)
Task Prototype-Based Knowledge Retrieval for Multi-Task Learning from Partially Annotated Data
by: Oh, Youngmin, et al.
Published: (2026)
by: Oh, Youngmin, et al.
Published: (2026)
Similar Items
-
MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation
by: Li, Shuowei, et al.
Published: (2026) -
MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning
by: Jiang, Zheng, et al.
Published: (2026) -
MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering
by: Dang, Jisheng, et al.
Published: (2025) -
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
by: Li, Chenglin, et al.
Published: (2026) -
MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network
by: Ahire, Vrushank, et al.
Published: (2025)