Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Feng, Bo, Lai, Zhengfeng, Li, Shiyu, Wang, Zizhen, Wang, Simon, Huang, Ping, Cao, Meng
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2505.14321
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915295113773056
author	Feng, Bo Lai, Zhengfeng Li, Shiyu Wang, Zizhen Wang, Simon Huang, Ping Cao, Meng
author_facet	Feng, Bo Lai, Zhengfeng Li, Shiyu Wang, Zizhen Wang, Simon Huang, Ping Cao, Meng
contents	Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions remain answerable even when the video frames are shuffled; and Temporal questions require understanding the correct temporal order of frames. The rest of the questions are labeled as Others. This can enable fine-grained evaluation of different capabilities of a video LLM. Our analysis reveals nuanced model weaknesses that are hidden by traditional overall scores, and we offer insights and recommendations for designing future benchmarks that more accurately assess video LLMs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_14321
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding? Feng, Bo Lai, Zhengfeng Li, Shiyu Wang, Zizhen Wang, Simon Huang, Ping Cao, Meng Computer Vision and Pattern Recognition Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions remain answerable even when the video frames are shuffled; and Temporal questions require understanding the correct temporal order of frames. The rest of the questions are labeled as Others. This can enable fine-grained evaluation of different capabilities of a video LLM. Our analysis reveals nuanced model weaknesses that are hidden by traditional overall scores, and we offer insights and recommendations for designing future benchmarks that more accurately assess video LLMs.
title	Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2505.14321

Similar Items