Saved in:
Bibliographic Details
Main Authors: Liu, Zihao, Wu, Xiaoyu, Li, Wenna, Wu, Jianqin, Yang, Linlin
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.07772
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908949177958400
author Liu, Zihao
Wu, Xiaoyu
Li, Wenna
Wu, Jianqin
Yang, Linlin
author_facet Liu, Zihao
Wu, Xiaoyu
Li, Wenna
Wu, Jianqin
Yang, Linlin
contents Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.
format Preprint
id arxiv_https___arxiv_org_abs_2604_07772
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions
Liu, Zihao
Wu, Xiaoyu
Li, Wenna
Wu, Jianqin
Yang, Linlin
Computer Vision and Pattern Recognition
Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.
title ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2604.07772