Saved in:
Bibliographic Details
Main Authors: Chu, Hailong, Li, Hongbing, Chu, Yunlong, Huang, Shutai, Zhang, Xingyue, Yan, Tinghe, Zhang, Jinsong, Zhang, Shuo, Li, Lei
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.06683
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Multimedia event extraction (M2E2) aims to predict triggers, ground arguments across text and images, and then assemble them into schema-consistent event records. Recent LLM-based approaches have shown strong potential for M2E2, but their intermediate event hypotheses often remain implicit, and event-argument linking is still tightly coupled with role binding. This leaves little opportunity to inspect or revise intermediate event hypotheses and makes predictions brittle to early errors. To bridge this gap, we present ECHO, a multi-agent framework that reframes M2E2 as iterative refinement over an explicit Multimedia Event Hypergraph (MEHG). Instead of relying on implicit linear generation, ECHO performs auditable atomic updates over a shared hypergraph, making intermediate event structures explicit and revisable. Furthermore, we introduce a Link-then-Bind strategy that decouples event-argument linking from role binding, reducing premature semantic commitment during structured prediction. Extensive experiments on the M2E2 benchmark show that ECHO consistently outperforms prior state-of-the-art approaches, achieving gains of 7.3 and 15.5 F1 points on event mention and argument role, respectively.