Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Sun, Yueming, Yang, Long, Jiao, Jianbo, Fu, Zeyu
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Multimedia
Online Access:	https://arxiv.org/abs/2602.09637
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915788447809536
author	Sun, Yueming Yang, Long Jiao, Jianbo Fu, Zeyu
author_facet	Sun, Yueming Yang, Long Jiao, Jianbo Fu, Zeyu
contents	The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_09637
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Towards Training-free Multimodal Hate Localisation with Large Language Models Sun, Yueming Yang, Long Jiao, Jianbo Fu, Zeyu Computer Vision and Pattern Recognition Multimedia The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.
title	Towards Training-free Multimodal Hate Localisation with Large Language Models
topic	Computer Vision and Pattern Recognition Multimedia
url	https://arxiv.org/abs/2602.09637

Similar Items