Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Ziyan, Li, Junwen, Li, Kaiwen, Ruan, Tong, Wang, Chao, He, Xinyan, Wang, Zongyu, Cao, Xuezhi, Liu, Jingping
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Information Retrieval
Online Access:	https://arxiv.org/abs/2508.02243
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918113609515008
author	Liu, Ziyan Li, Junwen Li, Kaiwen Ruan, Tong Wang, Chao He, Xinyan Wang, Zongyu Cao, Xuezhi Liu, Jingping
author_facet	Liu, Ziyan Li, Junwen Li, Kaiwen Ruan, Tong Wang, Chao He, Xinyan Wang, Zongyu Cao, Xuezhi Liu, Jingping
contents	Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at https://github.com/ziyan-xiaoyu/I2CR/.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_02243
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking Liu, Ziyan Li, Junwen Li, Kaiwen Ruan, Tong Wang, Chao He, Xinyan Wang, Zongyu Cao, Xuezhi Liu, Jingping Computer Vision and Pattern Recognition Information Retrieval Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at https://github.com/ziyan-xiaoyu/I2CR/.
title	I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking
topic	Computer Vision and Pattern Recognition Information Retrieval
url	https://arxiv.org/abs/2508.02243

Similar Items