Saved in:
Bibliographic Details
Main Authors: Xu, Binxiao, Feng, Junyu, Lin, Xiaopeng, Li, Haodong, Feng, Zhiyuan, Zeng, Bohan, Lu, Shaolin, Lu, Ming, She, Qi, Zhang, Wentao
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.07625
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918327077568512
author Xu, Binxiao
Feng, Junyu
Lin, Xiaopeng
Li, Haodong
Feng, Zhiyuan
Zeng, Bohan
Lu, Shaolin
Lu, Ming
She, Qi
Zhang, Wentao
author_facet Xu, Binxiao
Feng, Junyu
Lin, Xiaopeng
Li, Haodong
Feng, Zhiyuan
Zeng, Bohan
Lu, Shaolin
Lu, Ming
She, Qi
Zhang, Wentao
contents Multimodal understanding of advertising videos is essential for interpreting the intricate relationship between visual storytelling and abstract persuasion strategies. However, despite excelling at general search, existing agents often struggle to bridge the cognitive gap between pixel-level perception and high-level marketing logic. To address this challenge, we introduce AD-MIR, a framework designed to decode advertising intent via a two-stage architecture. First, in the Structure-Aware Memory Construction phase, the system converts raw video into a structured database by integrating semantic retrieval with exact keyword matching. This approach prioritizes fine-grained brand details (e.g., logos, on-screen text) while dynamically filtering out irrelevant background noise to isolate key protagonists. Second, the Structured Reasoning Agent mimics a marketing expert through an iterative inquiry loop, decomposing the narrative to deduce implicit persuasion tactics. Crucially, it employs an evidence-based self-correction mechanism that rigorously validates these insights against specific video frames, automatically backtracking when visual support is lacking. Evaluation on the AdsQA benchmark demonstrates that AD-MIR achieves state-of-the-art performance, surpassing the strongest general-purpose agent, DVD, by 1.8% in strict and 9.5% in relaxed accuracy. These results underscore that effective advertising understanding demands explicitly grounding abstract marketing strategies in pixel-level evidence. The code is available at https://github.com/Little-Fridge/AD-MIR.
format Preprint
id arxiv_https___arxiv_org_abs_2602_07625
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle AD-MIR: Bridging the Gap from Perception to Persuasion in Advertising Video Understanding via Structured Reasoning
Xu, Binxiao
Feng, Junyu
Lin, Xiaopeng
Li, Haodong
Feng, Zhiyuan
Zeng, Bohan
Lu, Shaolin
Lu, Ming
She, Qi
Zhang, Wentao
Computer Vision and Pattern Recognition
Artificial Intelligence
Multimodal understanding of advertising videos is essential for interpreting the intricate relationship between visual storytelling and abstract persuasion strategies. However, despite excelling at general search, existing agents often struggle to bridge the cognitive gap between pixel-level perception and high-level marketing logic. To address this challenge, we introduce AD-MIR, a framework designed to decode advertising intent via a two-stage architecture. First, in the Structure-Aware Memory Construction phase, the system converts raw video into a structured database by integrating semantic retrieval with exact keyword matching. This approach prioritizes fine-grained brand details (e.g., logos, on-screen text) while dynamically filtering out irrelevant background noise to isolate key protagonists. Second, the Structured Reasoning Agent mimics a marketing expert through an iterative inquiry loop, decomposing the narrative to deduce implicit persuasion tactics. Crucially, it employs an evidence-based self-correction mechanism that rigorously validates these insights against specific video frames, automatically backtracking when visual support is lacking. Evaluation on the AdsQA benchmark demonstrates that AD-MIR achieves state-of-the-art performance, surpassing the strongest general-purpose agent, DVD, by 1.8% in strict and 9.5% in relaxed accuracy. These results underscore that effective advertising understanding demands explicitly grounding abstract marketing strategies in pixel-level evidence. The code is available at https://github.com/Little-Fridge/AD-MIR.
title AD-MIR: Bridging the Gap from Perception to Persuasion in Advertising Video Understanding via Structured Reasoning
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2602.07625