Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Mingchao, Sun, Yu, Sun, Ruixiao, Dong, Xin, Shen, Xiang, Wang, Hongwei, Xiong, Hongyu, Song, Yang
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2412.15251
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Multimodal large language models (MLLMs) are effective at capturing the semantics of short video content; however, they often fail to attend to the policy-specific details required for reliable content moderation. To address this limitation, we introduce IPS, a novel framework that integrates In-prompt Process Supervision into MLLMs by introducing sequential reasoning over ancillary questions during fine-tuning. IPS consistently outperforms baseline MLLMs on public and proprietary benchmarks. Moreover, replacing human-annotated ancillary labels with MLLM-generated ones results in only marginal performance degradation, demonstrating robustness to noisy supervision and strong scalability with model-generated annotations. These findings establish IPS as a scalable and effective solution for complex multimodal classification in large-scale industrial settings.

Similar Items