Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Siyu, Liu, Wenzhe, Chen, Yeming, Wu, Yiming, Zheng, Heming, Cheng, Cheng
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2501.19069
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909519466987520
author	Zhang, Siyu Liu, Wenzhe Chen, Yeming Wu, Yiming Zheng, Heming Cheng, Cheng
author_facet	Zhang, Siyu Liu, Wenzhe Chen, Yeming Wu, Yiming Zheng, Heming Cheng, Cheng
contents	To bridge the semantic gap between vision and language (VL), it is necessary to develop a good alignment strategy, which includes handling semantic diversity, abstract representation of visual information, and generalization ability of models. Recent works use detector-based bounding boxes or patches with regular partitions to represent visual semantics. While current paradigms have made strides, they are still insufficient for fully capturing the nuanced contextual relations among various objects. This paper proposes a comprehensive visual semantic representation module, necessitating the utilization of panoptic segmentation to generate coherent fine-grained semantic features. Furthermore, we propose a novel Graph Spiking Hybrid Network (GSHN) that integrates the complementary advantages of Spiking Neural Networks (SNNs) and Graph Attention Networks (GATs) to encode visual semantic information. Intriguingly, the model not only encodes the discrete and continuous latent variables of instances but also adeptly captures both local and global contextual features, thereby significantly enhancing the richness and diversity of semantic representations. Leveraging the spatiotemporal properties inherent in SNNs, we employ contrastive learning (CL) to enhance the similarity-based representation of embeddings. This strategy alleviates the computational overhead of the model and enriches meaningful visual representations by constructing positive and negative sample pairs. We design an innovative pre-training method, Spiked Text Learning (STL), which uses text features to improve the encoding ability of discrete semantics. Experiments show that the proposed GSHN exhibits promising results on multiple VL downstream tasks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_19069
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Improving vision-language alignment with graph spiking hybrid Networks Zhang, Siyu Liu, Wenzhe Chen, Yeming Wu, Yiming Zheng, Heming Cheng, Cheng Computer Vision and Pattern Recognition Artificial Intelligence To bridge the semantic gap between vision and language (VL), it is necessary to develop a good alignment strategy, which includes handling semantic diversity, abstract representation of visual information, and generalization ability of models. Recent works use detector-based bounding boxes or patches with regular partitions to represent visual semantics. While current paradigms have made strides, they are still insufficient for fully capturing the nuanced contextual relations among various objects. This paper proposes a comprehensive visual semantic representation module, necessitating the utilization of panoptic segmentation to generate coherent fine-grained semantic features. Furthermore, we propose a novel Graph Spiking Hybrid Network (GSHN) that integrates the complementary advantages of Spiking Neural Networks (SNNs) and Graph Attention Networks (GATs) to encode visual semantic information. Intriguingly, the model not only encodes the discrete and continuous latent variables of instances but also adeptly captures both local and global contextual features, thereby significantly enhancing the richness and diversity of semantic representations. Leveraging the spatiotemporal properties inherent in SNNs, we employ contrastive learning (CL) to enhance the similarity-based representation of embeddings. This strategy alleviates the computational overhead of the model and enriches meaningful visual representations by constructing positive and negative sample pairs. We design an innovative pre-training method, Spiked Text Learning (STL), which uses text features to improve the encoding ability of discrete semantics. Experiments show that the proposed GSHN exhibits promising results on multiple VL downstream tasks.
title	Improving vision-language alignment with graph spiking hybrid Networks
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2501.19069

Similar Items