Saved in:
| Main Authors: | Du, Henghui, Zhang, Chunjie, Chen, Xi, Zhou, Chang, Hu, Di |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.17229 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
APPO: Attention-guided Perception Policy Optimization for Video Reasoning
by: Du, Henghui, et al.
Published: (2026)
by: Du, Henghui, et al.
Published: (2026)
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Du, Henghui, et al.
Published: (2025)
by: Du, Henghui, et al.
Published: (2025)
Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning
by: Zeng, Xiangyu, et al.
Published: (2026)
by: Zeng, Xiangyu, et al.
Published: (2026)
Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues
by: Zhang, Yan, et al.
Published: (2024)
by: Zhang, Yan, et al.
Published: (2024)
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
by: Chen, Guo, et al.
Published: (2024)
by: Chen, Guo, et al.
Published: (2024)
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
by: Li, Guangyao, et al.
Published: (2024)
by: Li, Guangyao, et al.
Published: (2024)
An Empirical Study on How Video-LLMs Answer Video Questions
by: Gou, Chenhui, et al.
Published: (2025)
by: Gou, Chenhui, et al.
Published: (2025)
VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
by: Yang, Ruoliu, et al.
Published: (2026)
by: Yang, Ruoliu, et al.
Published: (2026)
Grounded Question-Answering in Long Egocentric Videos
by: Di, Shangzhe, et al.
Published: (2023)
by: Di, Shangzhe, et al.
Published: (2023)
Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering
by: Liao, Zhaohe, et al.
Published: (2024)
by: Liao, Zhaohe, et al.
Published: (2024)
MOVE: Motion-Guided Few-Shot Video Object Segmentation
by: Ying, Kaining, et al.
Published: (2025)
by: Ying, Kaining, et al.
Published: (2025)
LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs
by: Li, Yunxin, et al.
Published: (2024)
by: Li, Yunxin, et al.
Published: (2024)
SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3
by: Shen, Ruiqi, et al.
Published: (2026)
by: Shen, Ruiqi, et al.
Published: (2026)
A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation
by: Zhang, Peixuan, et al.
Published: (2026)
by: Zhang, Peixuan, et al.
Published: (2026)
Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning
by: Chen, Tieyuan, et al.
Published: (2025)
by: Chen, Tieyuan, et al.
Published: (2025)
Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning
by: zhang, Kaixin, et al.
Published: (2026)
by: zhang, Kaixin, et al.
Published: (2026)
Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
by: He, Shuting, et al.
Published: (2024)
by: He, Shuting, et al.
Published: (2024)
Admitting Ignorance Helps the Video Question Answering Models to Answer
by: Li, Haopeng, et al.
Published: (2025)
by: Li, Haopeng, et al.
Published: (2025)
VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
by: Lin, Jingyang, et al.
Published: (2026)
by: Lin, Jingyang, et al.
Published: (2026)
MovieChat+: Question-aware Sparse Memory for Long Video Question Answering
by: Song, Enxin, et al.
Published: (2024)
by: Song, Enxin, et al.
Published: (2024)
Mode Seeking meets Mean Seeking for Fast Long Video Generation
by: Cai, Shengqu, et al.
Published: (2026)
by: Cai, Shengqu, et al.
Published: (2026)
Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Cai, Dongnuan, et al.
Published: (2026)
by: Cai, Dongnuan, et al.
Published: (2026)
AffectSeek: Agentic Affective Understanding in Long Videos under Vague User Queries
by: Zhang, Zhen, et al.
Published: (2026)
by: Zhang, Zhen, et al.
Published: (2026)
ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning
by: Xi, Gongli, et al.
Published: (2026)
by: Xi, Gongli, et al.
Published: (2026)
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
by: Wang, Ziyang, et al.
Published: (2025)
by: Wang, Ziyang, et al.
Published: (2025)
StreamReady: Learning What to Answer and When in Long Streaming Videos
by: Azad, Shehreen, et al.
Published: (2026)
by: Azad, Shehreen, et al.
Published: (2026)
Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos
by: Chen, Qirui, et al.
Published: (2024)
by: Chen, Qirui, et al.
Published: (2024)
VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority
by: Qiu, Chenhao, et al.
Published: (2026)
by: Qiu, Chenhao, et al.
Published: (2026)
Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models
by: Yoon, Eunseop, et al.
Published: (2025)
by: Yoon, Eunseop, et al.
Published: (2025)
VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges
by: Wang, Yuxuan, et al.
Published: (2024)
by: Wang, Yuxuan, et al.
Published: (2024)
VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding
by: He, Haichen, et al.
Published: (2026)
by: He, Haichen, et al.
Published: (2026)
LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models
by: Wei, Hongchen, et al.
Published: (2025)
by: Wei, Hongchen, et al.
Published: (2025)
Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios
by: Li, Xiaomin, et al.
Published: (2026)
by: Li, Xiaomin, et al.
Published: (2026)
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
by: Di, Shangzhe, et al.
Published: (2025)
by: Di, Shangzhe, et al.
Published: (2025)
MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
by: Ding, Henghui, et al.
Published: (2025)
by: Ding, Henghui, et al.
Published: (2025)
Narrative Aligned Long Form Video Question Answering
by: Jain, Rahul, et al.
Published: (2026)
by: Jain, Rahul, et al.
Published: (2026)
EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
by: Fu, Yang, et al.
Published: (2026)
by: Fu, Yang, et al.
Published: (2026)
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
by: Cheng, Junhao, et al.
Published: (2025)
by: Cheng, Junhao, et al.
Published: (2025)
Long-Form Answers to Visual Questions from Blind and Low Vision People
by: Huh, Mina, et al.
Published: (2024)
by: Huh, Mina, et al.
Published: (2024)
A Simple LLM Framework for Long-Range Video Question-Answering
by: Zhang, Ce, et al.
Published: (2023)
by: Zhang, Ce, et al.
Published: (2023)
Similar Items
-
APPO: Attention-guided Perception Policy Optimization for Video Reasoning
by: Du, Henghui, et al.
Published: (2026) -
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Du, Henghui, et al.
Published: (2025) -
Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning
by: Zeng, Xiangyu, et al.
Published: (2026) -
Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues
by: Zhang, Yan, et al.
Published: (2024) -
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
by: Chen, Guo, et al.
Published: (2024)