Saved in:
Bibliographic Details
Main Authors: Jin, Yujie, Zhang, Wenxin, Wang, Jingjing, Zhou, Guodong
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.18019
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914339725770752
author Jin, Yujie
Zhang, Wenxin
Wang, Jingjing
Zhou, Guodong
author_facet Jin, Yujie
Zhang, Wenxin
Wang, Jingjing
Zhou, Guodong
contents In the literature, prior research on Security-oriented Video Understanding (SVU) has predominantly focused on detecting and localize the threats (e.g., shootings, robberies) in videos, while largely lacking the effective capability to generate and evaluate the threat causes. Motivated by these gaps, this paper introduces a new chat paradigm SVU task, i.e., In-depth Security-oriented Video Understanding (DeepSVU), which aims to not only identify and locate the threats but also attribute and evaluate the causes threatening segments. Furthermore, this paper reveals two key challenges in the proposed task: 1) how to effectively model the coarse-to-fine physical-world information (e.g., human behavior, object interactions and background context) to boost the DeepSVU task; and 2) how to adaptively trade off these factors. To tackle these challenges, this paper proposes a new Unified Physical-world Regularized MoE (UPRM) approach. Specifically, UPRM incorporates two key components: the Unified Physical-world Enhanced MoE (UPE) Block and the Physical-world Trade-off Regularizer (PTR), to address the above two challenges, respectively. Extensive experiments conduct on our DeepSVU instructions datasets (i.e., UCF-C instructions and CUVA instructions) demonstrate that UPRM outperforms several advanced Video-LLMs as well as non-VLM approaches. Such information.These justify the importance of the coarse-to-fine physical-world information in the DeepSVU task and demonstrate the effectiveness of our UPRM in capturing such information.
format Preprint
id arxiv_https___arxiv_org_abs_2602_18019
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE
Jin, Yujie
Zhang, Wenxin
Wang, Jingjing
Zhou, Guodong
Computer Vision and Pattern Recognition
Artificial Intelligence
In the literature, prior research on Security-oriented Video Understanding (SVU) has predominantly focused on detecting and localize the threats (e.g., shootings, robberies) in videos, while largely lacking the effective capability to generate and evaluate the threat causes. Motivated by these gaps, this paper introduces a new chat paradigm SVU task, i.e., In-depth Security-oriented Video Understanding (DeepSVU), which aims to not only identify and locate the threats but also attribute and evaluate the causes threatening segments. Furthermore, this paper reveals two key challenges in the proposed task: 1) how to effectively model the coarse-to-fine physical-world information (e.g., human behavior, object interactions and background context) to boost the DeepSVU task; and 2) how to adaptively trade off these factors. To tackle these challenges, this paper proposes a new Unified Physical-world Regularized MoE (UPRM) approach. Specifically, UPRM incorporates two key components: the Unified Physical-world Enhanced MoE (UPE) Block and the Physical-world Trade-off Regularizer (PTR), to address the above two challenges, respectively. Extensive experiments conduct on our DeepSVU instructions datasets (i.e., UCF-C instructions and CUVA instructions) demonstrate that UPRM outperforms several advanced Video-LLMs as well as non-VLM approaches. Such information.These justify the importance of the coarse-to-fine physical-world information in the DeepSVU task and demonstrate the effectiveness of our UPRM in capturing such information.
title DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2602.18019