Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Abdullah, Hasnat Md, Liu, Tian, Wei, Kangda, Kong, Shu, Huang, Ruihong
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2410.01180
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908276389576704
author	Abdullah, Hasnat Md Liu, Tian Wei, Kangda Kong, Shu Huang, Ruihong
author_facet	Abdullah, Hasnat Md Liu, Tian Wei, Kangda Kong, Shu Huang, Ruihong
contents	Localizing unusual activities, such as human errors or surveillance incidents, in videos holds practical significance. However, current video understanding models struggle with localizing these unusual events likely because of their insufficient representation in models' pretraining datasets. To explore foundation models' capability in localizing unusual activity, we introduce UAL-Bench, a comprehensive benchmark for unusual activity localization, featuring three video datasets: UAG-OOPS, UAG-SSBD, UAG-FunQA, and an instruction-tune dataset: OOPS-UAG-Instruct, to improve model capabilities. UAL-Bench evaluates three approaches: Video-Language Models (Vid-LLMs), instruction-tuned Vid-LLMs, and a novel integration of Vision-Language Models and Large Language Models (VLM-LLM). Our results show the VLM-LLM approach excels in localizing short-span unusual events and predicting their onset (start time) more accurately than Vid-LLMs. We also propose a new metric, R@1, TD <= p, to address limitations in existing evaluation methods. Our findings highlight the challenges posed by long-duration videos, particularly in autism diagnosis scenarios, and the need for further advancements in localization techniques. Our work not only provides a benchmark for unusual activity localization but also outlines the key challenges for existing foundation models, suggesting future research directions on this important task.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_01180
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark Abdullah, Hasnat Md Liu, Tian Wei, Kangda Kong, Shu Huang, Ruihong Computer Vision and Pattern Recognition Computation and Language Localizing unusual activities, such as human errors or surveillance incidents, in videos holds practical significance. However, current video understanding models struggle with localizing these unusual events likely because of their insufficient representation in models' pretraining datasets. To explore foundation models' capability in localizing unusual activity, we introduce UAL-Bench, a comprehensive benchmark for unusual activity localization, featuring three video datasets: UAG-OOPS, UAG-SSBD, UAG-FunQA, and an instruction-tune dataset: OOPS-UAG-Instruct, to improve model capabilities. UAL-Bench evaluates three approaches: Video-Language Models (Vid-LLMs), instruction-tuned Vid-LLMs, and a novel integration of Vision-Language Models and Large Language Models (VLM-LLM). Our results show the VLM-LLM approach excels in localizing short-span unusual events and predicting their onset (start time) more accurately than Vid-LLMs. We also propose a new metric, R@1, TD <= p, to address limitations in existing evaluation methods. Our findings highlight the challenges posed by long-duration videos, particularly in autism diagnosis scenarios, and the need for further advancements in localization techniques. Our work not only provides a benchmark for unusual activity localization but also outlines the key challenges for existing foundation models, suggesting future research directions on this important task.
title	UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark
topic	Computer Vision and Pattern Recognition Computation and Language
url	https://arxiv.org/abs/2410.01180

Similar Items