Saved in:
Bibliographic Details
Main Authors: Bekit, Lokman, Karim, Hamza, Nguyen, Nghia T, Yilmaz, Yasin
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.03040
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913001688268800
author Bekit, Lokman
Karim, Hamza
Nguyen, Nghia T
Yilmaz, Yasin
author_facet Bekit, Lokman
Karim, Hamza
Nguyen, Nghia T
Yilmaz, Yasin
contents Video Anomaly Detection (VAD) is a fundamental challenge in computer vision, particularly due to the open-set nature of anomalies. While recent training-free approaches utilizing Vision-Language Models (VLMs) have shown promise, they typically rely on massive, resource-intensive foundation models to compensate for the ambiguity of static prompts. We argue that the bottleneck in VAD is not necessarily model capacity, but rather the static nature of inquiry. We propose QVAD, a question-centric agentic framework that treats VLM-LLM interaction as a dynamic dialogue. By iteratively refining queries based on visual context, our LLM agent guides smaller VLMs to produce high-fidelity captions and precise semantic reasoning without parameter updates. This ``prompt-updating" mechanism effectively unlocks the latent capabilities of lightweight models, enabling state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal using a fraction of the parameters required by competing methods. We further demonstrate exceptional generalizability on the single-scene ComplexVAD dataset. Crucially, QVAD achieves high inference speeds with minimal memory footprints, making advanced VAD capabilities deployable on resource-constrained edge devices.
format Preprint
id arxiv_https___arxiv_org_abs_2604_03040
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection
Bekit, Lokman
Karim, Hamza
Nguyen, Nghia T
Yilmaz, Yasin
Computer Vision and Pattern Recognition
Video Anomaly Detection (VAD) is a fundamental challenge in computer vision, particularly due to the open-set nature of anomalies. While recent training-free approaches utilizing Vision-Language Models (VLMs) have shown promise, they typically rely on massive, resource-intensive foundation models to compensate for the ambiguity of static prompts. We argue that the bottleneck in VAD is not necessarily model capacity, but rather the static nature of inquiry. We propose QVAD, a question-centric agentic framework that treats VLM-LLM interaction as a dynamic dialogue. By iteratively refining queries based on visual context, our LLM agent guides smaller VLMs to produce high-fidelity captions and precise semantic reasoning without parameter updates. This ``prompt-updating" mechanism effectively unlocks the latent capabilities of lightweight models, enabling state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal using a fraction of the parameters required by competing methods. We further demonstrate exceptional generalizability on the single-scene ComplexVAD dataset. Crucially, QVAD achieves high inference speeds with minimal memory footprints, making advanced VAD capabilities deployable on resource-constrained edge devices.
title QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2604.03040