Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Zihao, Wu, Xiaoyu, Li, Wenna, Wu, Jianqin, Yang, Linlin
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.07772
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908949177958400
author	Liu, Zihao Wu, Xiaoyu Li, Wenna Wu, Jianqin Yang, Linlin
author_facet	Liu, Zihao Wu, Xiaoyu Li, Wenna Wu, Jianqin Yang, Linlin
contents	Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_07772
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions Liu, Zihao Wu, Xiaoyu Li, Wenna Wu, Jianqin Yang, Linlin Computer Vision and Pattern Recognition Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.
title	ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2604.07772

Similar Items