Enregistré dans:
Détails bibliographiques
Auteurs principaux: Zhou, Yiyang, Li, Linjie, Qiu, Shi, Yang, Zhengyuan, Zhao, Yuyang, Han, Siwei, He, Yangfan, Li, Kangqi, Ji, Haonian, Zhao, Zihao, Tong, Haibo, Wang, Lijuan, Yao, Huaxiu
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2507.09491
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866908448196657152
author Zhou, Yiyang
Li, Linjie
Qiu, Shi
Yang, Zhengyuan
Zhao, Yuyang
Han, Siwei
He, Yangfan
Li, Kangqi
Ji, Haonian
Zhao, Zihao
Tong, Haibo
Wang, Lijuan
Yao, Huaxiu
author_facet Zhou, Yiyang
Li, Linjie
Qiu, Shi
Yang, Zhengyuan
Zhao, Yuyang
Han, Siwei
He, Yangfan
Li, Kangqi
Ji, Haonian
Zhao, Zihao
Tong, Haibo
Wang, Lijuan
Yao, Huaxiu
contents Existing video benchmarks often resemble image-based benchmarks, with question types like "What actions does the person perform throughout the video?" or "What color is the woman's dress in the video?" For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce GLIMPSE, a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, GLIMPSE emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context-this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, GLIMPSE achieves 94.82% accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reaches only 66.43%, highlighting that LVLMs still struggle to move beyond surface-level reasoning to truly think with videos.
format Preprint
id arxiv_https___arxiv_org_abs_2507_09491
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?
Zhou, Yiyang
Li, Linjie
Qiu, Shi
Yang, Zhengyuan
Zhao, Yuyang
Han, Siwei
He, Yangfan
Li, Kangqi
Ji, Haonian
Zhao, Zihao
Tong, Haibo
Wang, Lijuan
Yao, Huaxiu
Computer Vision and Pattern Recognition
Existing video benchmarks often resemble image-based benchmarks, with question types like "What actions does the person perform throughout the video?" or "What color is the woman's dress in the video?" For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce GLIMPSE, a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, GLIMPSE emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context-this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, GLIMPSE achieves 94.82% accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reaches only 66.43%, highlighting that LVLMs still struggle to move beyond surface-level reasoning to truly think with videos.
title GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2507.09491