Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Liang, Zhengyang, Shu, Yan, Liu, Xiangrui, Qin, Minghao, Liang, Kaixin, Sebe, Nicu, Liu, Zheng, Liao, Lizi
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Computer Vision and Pattern Recognition
Online-Zugang:	https://arxiv.org/abs/2512.23044
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866909992433483776
author	Liang, Zhengyang Shu, Yan Liu, Xiangrui Qin, Minghao Liang, Kaixin Sebe, Nicu Liu, Zheng Liao, Lizi
author_facet	Liang, Zhengyang Shu, Yan Liu, Xiangrui Qin, Minghao Liang, Kaixin Sebe, Nicu Liu, Zheng Liao, Lizi
contents	The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, a significant modality gap remains in processing the web's most dynamic and information-dense modality: video. In this paper, we first formalize the task of Agentic Video Browsing and introduce Video-BrowseComp, a benchmark evaluating open-ended agentic browsing tasks that enforce a mandatory dependency on videos. We observe that current paradigms struggle to reconcile the scale of open-ended video exploration with the need for fine-grained visual verification. Direct visual inference (e.g., RAG) maximizes perception but incurs prohibitive context costs, while text-centric summarization optimizes efficiency but often misses critical visual details required for accurate grounding. To address this, we propose Video-Browser, a novel agent leveraging Pyramidal Perception, filtering with cheap metadata and zooming in with expensive visual perception only when necessary. Experiments demonstrate that our approach achieves a 37.5% relative improvement while reducing token consumption by 58.3% compared to Direct visual inference, establishing a foundation for verifiable open-web video research. We open-source all codes, benchmark at {https://anonymous.4open.science/r/VideoBrowser} and {https://github.com/chrisx599/Video-Browser}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_23044
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Video-Browser: Towards Agentic Open-web Video Browsing Liang, Zhengyang Shu, Yan Liu, Xiangrui Qin, Minghao Liang, Kaixin Sebe, Nicu Liu, Zheng Liao, Lizi Computer Vision and Pattern Recognition The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, a significant modality gap remains in processing the web's most dynamic and information-dense modality: video. In this paper, we first formalize the task of Agentic Video Browsing and introduce Video-BrowseComp, a benchmark evaluating open-ended agentic browsing tasks that enforce a mandatory dependency on videos. We observe that current paradigms struggle to reconcile the scale of open-ended video exploration with the need for fine-grained visual verification. Direct visual inference (e.g., RAG) maximizes perception but incurs prohibitive context costs, while text-centric summarization optimizes efficiency but often misses critical visual details required for accurate grounding. To address this, we propose Video-Browser, a novel agent leveraging Pyramidal Perception, filtering with cheap metadata and zooming in with expensive visual perception only when necessary. Experiments demonstrate that our approach achieves a 37.5% relative improvement while reducing token consumption by 58.3% compared to Direct visual inference, establishing a foundation for verifiable open-web video research. We open-source all codes, benchmark at {https://anonymous.4open.science/r/VideoBrowser} and {https://github.com/chrisx599/Video-Browser}.
title	Video-Browser: Towards Agentic Open-web Video Browsing
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2512.23044

Ähnliche Einträge