Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yang, Te, Zhu, Xiangyu, Wang, Bo, Chen, Quan, Jiang, Peng, Lei, Zhen
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2512.03500
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908690576048128
author	Yang, Te Zhu, Xiangyu Wang, Bo Chen, Quan Jiang, Peng Lei, Zhen
author_facet	Yang, Te Zhu, Xiangyu Wang, Bo Chen, Quan Jiang, Peng Lei, Zhen
contents	Long-form video understanding requires efficient navigation of extensive visual data to pinpoint sparse yet critical information. Current approaches to longform video understanding either suffer from severe computational overhead due to dense preprocessing, or fail to effectively balance exploration and exploitation, resulting in incomplete information coverage and inefficiency. In this work, we introduce EEA, a novel video agent framework that archives exploration-exploitation balance through semantic guidance with hierarchical tree search process. EEA autonomously discovers and dynamically updates task-relevant semantic queries, and collects video frames closely matched to these queries as semantic anchors. During the tree search process, instead of uniform expansion, EEA preferentially explores semantically relevant frames while ensuring sufficient coverage within unknown segments. Moreover, EEA adaptively combines intrinsic rewards from visionlanguage models (VLMs) with semantic priors by explicitly modeling uncertainty to achieve stable and precise evaluation of video segments. Experiments across various long-video benchmarks validate the superior performance and computational efficiency of our proposed method.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_03500
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	EEA: Exploration-Exploitation Agent for Long Video Understanding Yang, Te Zhu, Xiangyu Wang, Bo Chen, Quan Jiang, Peng Lei, Zhen Computer Vision and Pattern Recognition Long-form video understanding requires efficient navigation of extensive visual data to pinpoint sparse yet critical information. Current approaches to longform video understanding either suffer from severe computational overhead due to dense preprocessing, or fail to effectively balance exploration and exploitation, resulting in incomplete information coverage and inefficiency. In this work, we introduce EEA, a novel video agent framework that archives exploration-exploitation balance through semantic guidance with hierarchical tree search process. EEA autonomously discovers and dynamically updates task-relevant semantic queries, and collects video frames closely matched to these queries as semantic anchors. During the tree search process, instead of uniform expansion, EEA preferentially explores semantically relevant frames while ensuring sufficient coverage within unknown segments. Moreover, EEA adaptively combines intrinsic rewards from visionlanguage models (VLMs) with semantic priors by explicitly modeling uncertainty to achieve stable and precise evaluation of video segments. Experiments across various long-video benchmarks validate the superior performance and computational efficiency of our proposed method.
title	EEA: Exploration-Exploitation Agent for Long Video Understanding
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2512.03500

Similar Items