Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Ye, Ma, Zongyang, Qi, Zhongang, Wu, Yang, Shan, Ying, Chen, Chang Wen
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2409.18111
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909327309144064
author	Liu, Ye Ma, Zongyang Qi, Zhongang Wu, Yang Shan, Ying Chen, Chang Wen
author_facet	Liu, Ye Ma, Zongyang Qi, Zhongang Wu, Yang Shan, Ying Chen, Chang Wen
contents	Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their great potential in general-purpose video understanding. To verify the significance of these models, a number of benchmarks have been proposed to diagnose their capabilities in different scenarios. However, existing benchmarks merely evaluate models through video-level question-answering, lacking fine-grained event-level assessment and task diversity. To fill this gap, we introduce E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark), a large-scale and high-quality benchmark for open-ended event-level video understanding. Categorized within a 3-level task taxonomy, E.T. Bench encompasses 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains, providing comprehensive evaluations. We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that state-of-the-art models for coarse-level (video-level) understanding struggle to solve our fine-grained tasks, e.g., grounding event-of-interests within videos, largely due to the short video context length, improper time representations, and lack of multi-event training data. Focusing on these issues, we further propose a strong baseline model, E.T. Chat, together with an instruction-tuning dataset E.T. Instruct 164K tailored for fine-grained event-level understanding. Our simple but effective solution demonstrates superior performance in multiple scenarios.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_18111
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding Liu, Ye Ma, Zongyang Qi, Zhongang Wu, Yang Shan, Ying Chen, Chang Wen Computer Vision and Pattern Recognition Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their great potential in general-purpose video understanding. To verify the significance of these models, a number of benchmarks have been proposed to diagnose their capabilities in different scenarios. However, existing benchmarks merely evaluate models through video-level question-answering, lacking fine-grained event-level assessment and task diversity. To fill this gap, we introduce E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark), a large-scale and high-quality benchmark for open-ended event-level video understanding. Categorized within a 3-level task taxonomy, E.T. Bench encompasses 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains, providing comprehensive evaluations. We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that state-of-the-art models for coarse-level (video-level) understanding struggle to solve our fine-grained tasks, e.g., grounding event-of-interests within videos, largely due to the short video context length, improper time representations, and lack of multi-event training data. Focusing on these issues, we further propose a strong baseline model, E.T. Chat, together with an instruction-tuning dataset E.T. Instruct 164K tailored for fine-grained event-level understanding. Our simple but effective solution demonstrates superior performance in multiple scenarios.
title	E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2409.18111

Similar Items