Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.02891 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866910100079247360 |
|---|---|
| author | Yin, Yufei Xing, Yuchen Meng, Qianke Chen, Minghao Yang, Yan Yu, Zhou |
| author_facet | Yin, Yufei Xing, Yuchen Meng, Qianke Chen, Minghao Yang, Yan Yu, Zhou |
| contents | Understanding long videos requires extracting query-relevant information from long sequences under tight compute budgets. Existing text-then-LLM pipelines lose fine-grained visual cues, while video-based multimodal large language models (MLLMs) can keep visual details but are too frame-hungry and computationally expensive. In this work, we aim to harness MLLMs for efficient video understanding. We propose ProVCA, a progressive video condensation agent that iteratively locates key video frames at multiple granularities. ProVCA first adopts a segment localization module to identify the video segment relevant to the query, then a snippet selection module to select important snippets based on similarity, and finally a keyframe refinement module to pinpoint specific keyframes in those snippets. By progressively narrowing the scope from coarse segments to fine frames, ProVCA identifies a small set of keyframes for MLLM-based reasoning. ProVCA achieves state-of-the-art zero-shot accuracies of 69.3\% on EgoSchema, 80.5\% on NExT-QA, and 77.7\% on IntentQA, while using fewer frames than previous training-free methods. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2604_02891 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Progressive Video Condensation with MLLM Agent for Long-form Video Understanding Yin, Yufei Xing, Yuchen Meng, Qianke Chen, Minghao Yang, Yan Yu, Zhou Computer Vision and Pattern Recognition Understanding long videos requires extracting query-relevant information from long sequences under tight compute budgets. Existing text-then-LLM pipelines lose fine-grained visual cues, while video-based multimodal large language models (MLLMs) can keep visual details but are too frame-hungry and computationally expensive. In this work, we aim to harness MLLMs for efficient video understanding. We propose ProVCA, a progressive video condensation agent that iteratively locates key video frames at multiple granularities. ProVCA first adopts a segment localization module to identify the video segment relevant to the query, then a snippet selection module to select important snippets based on similarity, and finally a keyframe refinement module to pinpoint specific keyframes in those snippets. By progressively narrowing the scope from coarse segments to fine frames, ProVCA identifies a small set of keyframes for MLLM-based reasoning. ProVCA achieves state-of-the-art zero-shot accuracies of 69.3\% on EgoSchema, 80.5\% on NExT-QA, and 77.7\% on IntentQA, while using fewer frames than previous training-free methods. |
| title | Progressive Video Condensation with MLLM Agent for Long-form Video Understanding |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2604.02891 |