Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yuan, Junyi, Zhang, Jian, Wu, Fangyu, Lu, Dongming, Lu, Huanda, Wang, Qiufeng
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2505.10921
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912492195676160
author	Yuan, Junyi Zhang, Jian Wu, Fangyu Lu, Dongming Lu, Huanda Wang, Qiufeng
author_facet	Yuan, Junyi Zhang, Jian Wu, Fangyu Lu, Dongming Lu, Huanda Wang, Qiufeng
contents	China has a long and rich history, encompassing a vast cultural heritage that includes diverse multimodal information, such as silk patterns, Dunhuang murals, and their associated historical narratives. Cross-modal retrieval plays a pivotal role in understanding and interpreting Chinese cultural heritage by bridging visual and textual modalities to enable accurate text-to-image and image-to-text retrieval. However, despite the growing interest in multimodal research, there is a lack of specialized datasets dedicated to Chinese cultural heritage, limiting the development and evaluation of cross-modal learning models in this domain. To address this gap, we propose a multimodal dataset named CulTi, which contains 5,726 image-text pairs extracted from two series of professional documents, respectively related to ancient Chinese silk and Dunhuang murals. Compared to existing general-domain multimodal datasets, CulTi presents a challenge for cross-modal retrieval: the difficulty of local alignment between intricate decorative motifs and specialized textual descriptions. To address this challenge, we propose LACLIP, a training-free local alignment strategy built upon a fine-tuned Chinese-CLIP. LACLIP enhances the alignment of global textual descriptions with local visual regions by computing weighted similarity scores during inference. Experimental results on CulTi demonstrate that LACLIP significantly outperforms existing models in cross-modal retrieval, particularly in handling fine-grained semantic associations within Chinese cultural heritage.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_10921
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution Yuan, Junyi Zhang, Jian Wu, Fangyu Lu, Dongming Lu, Huanda Wang, Qiufeng Computer Vision and Pattern Recognition China has a long and rich history, encompassing a vast cultural heritage that includes diverse multimodal information, such as silk patterns, Dunhuang murals, and their associated historical narratives. Cross-modal retrieval plays a pivotal role in understanding and interpreting Chinese cultural heritage by bridging visual and textual modalities to enable accurate text-to-image and image-to-text retrieval. However, despite the growing interest in multimodal research, there is a lack of specialized datasets dedicated to Chinese cultural heritage, limiting the development and evaluation of cross-modal learning models in this domain. To address this gap, we propose a multimodal dataset named CulTi, which contains 5,726 image-text pairs extracted from two series of professional documents, respectively related to ancient Chinese silk and Dunhuang murals. Compared to existing general-domain multimodal datasets, CulTi presents a challenge for cross-modal retrieval: the difficulty of local alignment between intricate decorative motifs and specialized textual descriptions. To address this challenge, we propose LACLIP, a training-free local alignment strategy built upon a fine-tuned Chinese-CLIP. LACLIP enhances the alignment of global textual descriptions with local visual regions by computing weighted similarity scores during inference. Experimental results on CulTi demonstrate that LACLIP significantly outperforms existing models in cross-modal retrieval, particularly in handling fine-grained semantic associations within Chinese cultural heritage.
title	Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2505.10921

Similar Items