Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Liang, Zhengyang, Shu, Yan, Liu, Xiangrui, Qin, Minghao, Liang, Kaixin, Sebe, Nicu, Liu, Zheng, Liao, Lizi
Format: Preprint
Veröffentlicht: 2025
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2512.23044
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866909992433483776
author Liang, Zhengyang
Shu, Yan
Liu, Xiangrui
Qin, Minghao
Liang, Kaixin
Sebe, Nicu
Liu, Zheng
Liao, Lizi
author_facet Liang, Zhengyang
Shu, Yan
Liu, Xiangrui
Qin, Minghao
Liang, Kaixin
Sebe, Nicu
Liu, Zheng
Liao, Lizi
contents The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, a significant modality gap remains in processing the web's most dynamic and information-dense modality: video. In this paper, we first formalize the task of Agentic Video Browsing and introduce Video-BrowseComp, a benchmark evaluating open-ended agentic browsing tasks that enforce a mandatory dependency on videos. We observe that current paradigms struggle to reconcile the scale of open-ended video exploration with the need for fine-grained visual verification. Direct visual inference (e.g., RAG) maximizes perception but incurs prohibitive context costs, while text-centric summarization optimizes efficiency but often misses critical visual details required for accurate grounding. To address this, we propose Video-Browser, a novel agent leveraging Pyramidal Perception, filtering with cheap metadata and zooming in with expensive visual perception only when necessary. Experiments demonstrate that our approach achieves a 37.5% relative improvement while reducing token consumption by 58.3% compared to Direct visual inference, establishing a foundation for verifiable open-web video research. We open-source all codes, benchmark at {https://anonymous.4open.science/r/VideoBrowser} and {https://github.com/chrisx599/Video-Browser}.
format Preprint
id arxiv_https___arxiv_org_abs_2512_23044
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Video-Browser: Towards Agentic Open-web Video Browsing
Liang, Zhengyang
Shu, Yan
Liu, Xiangrui
Qin, Minghao
Liang, Kaixin
Sebe, Nicu
Liu, Zheng
Liao, Lizi
Computer Vision and Pattern Recognition
The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, a significant modality gap remains in processing the web's most dynamic and information-dense modality: video. In this paper, we first formalize the task of Agentic Video Browsing and introduce Video-BrowseComp, a benchmark evaluating open-ended agentic browsing tasks that enforce a mandatory dependency on videos. We observe that current paradigms struggle to reconcile the scale of open-ended video exploration with the need for fine-grained visual verification. Direct visual inference (e.g., RAG) maximizes perception but incurs prohibitive context costs, while text-centric summarization optimizes efficiency but often misses critical visual details required for accurate grounding. To address this, we propose Video-Browser, a novel agent leveraging Pyramidal Perception, filtering with cheap metadata and zooming in with expensive visual perception only when necessary. Experiments demonstrate that our approach achieves a 37.5% relative improvement while reducing token consumption by 58.3% compared to Direct visual inference, establishing a foundation for verifiable open-web video research. We open-source all codes, benchmark at {https://anonymous.4open.science/r/VideoBrowser} and {https://github.com/chrisx599/Video-Browser}.
title Video-Browser: Towards Agentic Open-web Video Browsing
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2512.23044