:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Du, Henghui, Zhang, Chunjie, Chen, Xi, Zhou, Chang, Hu, Di
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2512.17229
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

APPO: Attention-guided Perception Policy Optimization for Video Reasoning
by: Du, Henghui, et al.
Published: (2026)

Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Du, Henghui, et al.
Published: (2025)

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning
by: Zeng, Xiangyu, et al.
Published: (2026)

Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues
by: Zhang, Yan, et al.
Published: (2024)

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
by: Chen, Guo, et al.
Published: (2024)

Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
by: Li, Guangyao, et al.
Published: (2024)

An Empirical Study on How Video-LLMs Answer Video Questions
by: Gou, Chenhui, et al.
Published: (2025)

VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
by: Yang, Ruoliu, et al.
Published: (2026)

Grounded Question-Answering in Long Egocentric Videos
by: Di, Shangzhe, et al.
Published: (2023)

Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering
by: Liao, Zhaohe, et al.
Published: (2024)

MOVE: Motion-Guided Few-Shot Video Object Segmentation
by: Ying, Kaining, et al.
Published: (2025)

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs
by: Li, Yunxin, et al.
Published: (2024)

SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3
by: Shen, Ruiqi, et al.
Published: (2026)

A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation
by: Zhang, Peixuan, et al.
Published: (2026)

Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning
by: Chen, Tieyuan, et al.
Published: (2025)

Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning
by: zhang, Kaixin, et al.
Published: (2026)

Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
by: He, Shuting, et al.
Published: (2024)

Admitting Ignorance Helps the Video Question Answering Models to Answer
by: Li, Haopeng, et al.
Published: (2025)

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
by: Lin, Jingyang, et al.
Published: (2026)

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering
by: Song, Enxin, et al.
Published: (2024)

Mode Seeking meets Mean Seeking for Fast Long Video Generation
by: Cai, Shengqu, et al.
Published: (2026)

Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Cai, Dongnuan, et al.
Published: (2026)

AffectSeek: Agentic Affective Understanding in Long Videos under Vague User Queries
by: Zhang, Zhen, et al.
Published: (2026)

ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning
by: Xi, Gongli, et al.
Published: (2026)

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
by: Wang, Ziyang, et al.
Published: (2025)

StreamReady: Learning What to Answer and When in Long Streaming Videos
by: Azad, Shehreen, et al.
Published: (2026)

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos
by: Chen, Qirui, et al.
Published: (2024)

VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority
by: Qiu, Chenhao, et al.
Published: (2026)

Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models
by: Yoon, Eunseop, et al.
Published: (2025)

VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges
by: Wang, Yuxuan, et al.
Published: (2024)

VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding
by: He, Haichen, et al.
Published: (2026)

LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models
by: Wei, Hongchen, et al.
Published: (2025)

Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios
by: Li, Xiaomin, et al.
Published: (2026)

Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
by: Di, Shangzhe, et al.
Published: (2025)

MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
by: Ding, Henghui, et al.
Published: (2025)

Narrative Aligned Long Form Video Question Answering
by: Jain, Rahul, et al.
Published: (2026)

EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
by: Fu, Yang, et al.
Published: (2026)

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
by: Cheng, Junhao, et al.
Published: (2025)

Long-Form Answers to Visual Questions from Blind and Low Vision People
by: Huh, Mina, et al.
Published: (2024)

A Simple LLM Framework for Long-Range Video Question-Answering
by: Zhang, Ce, et al.
Published: (2023)