Saved in:
Bibliographic Details
Main Authors: Pandey, Ananya, Vishwakarma, Dinesh Kumar
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2408.02571
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911977472786432
author Pandey, Ananya
Vishwakarma, Dinesh Kumar
author_facet Pandey, Ananya
Vishwakarma, Dinesh Kumar
contents The emoticons are symbolic representations that generally accompany the textual content to visually enhance or summarize the true intention of a written message. Although widely utilized in the realm of social media, the core semantics of these emoticons have not been extensively explored based on multiple modalities. Incorporating textual and visual information within a single message develops an advanced way of conveying information. Hence, this research aims to analyze the relationship among sentences, visuals, and emoticons. For an orderly exposition, this paper initially provides a detailed examination of the various techniques for extracting multimodal features, emphasizing the pros and cons of each method. Through conducting a comprehensive examination of several multimodal algorithms, with specific emphasis on the fusion approaches, we have proposed a novel contrastive learning based multimodal architecture. The proposed model employs the joint training of dual-branch encoder along with the contrastive learning to accurately map text and images into a common latent space. Our key finding is that by integrating the principle of contrastive learning with that of the other two branches yields superior results. The experimental results demonstrate that our suggested methodology surpasses existing multimodal approaches in terms of accuracy and robustness. The proposed model attained an accuracy of 91% and an MCC-score of 90% while assessing emoticons using the Multimodal-Twitter Emoticon dataset acquired from Twitter. We provide evidence that deep features acquired by contrastive learning are more efficient, suggesting that the proposed fusion technique also possesses strong generalisation capabilities for recognising emoticons across several modes.
format Preprint
id arxiv_https___arxiv_org_abs_2408_02571
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs
Pandey, Ananya
Vishwakarma, Dinesh Kumar
Computer Vision and Pattern Recognition
Artificial Intelligence
The emoticons are symbolic representations that generally accompany the textual content to visually enhance or summarize the true intention of a written message. Although widely utilized in the realm of social media, the core semantics of these emoticons have not been extensively explored based on multiple modalities. Incorporating textual and visual information within a single message develops an advanced way of conveying information. Hence, this research aims to analyze the relationship among sentences, visuals, and emoticons. For an orderly exposition, this paper initially provides a detailed examination of the various techniques for extracting multimodal features, emphasizing the pros and cons of each method. Through conducting a comprehensive examination of several multimodal algorithms, with specific emphasis on the fusion approaches, we have proposed a novel contrastive learning based multimodal architecture. The proposed model employs the joint training of dual-branch encoder along with the contrastive learning to accurately map text and images into a common latent space. Our key finding is that by integrating the principle of contrastive learning with that of the other two branches yields superior results. The experimental results demonstrate that our suggested methodology surpasses existing multimodal approaches in terms of accuracy and robustness. The proposed model attained an accuracy of 91% and an MCC-score of 90% while assessing emoticons using the Multimodal-Twitter Emoticon dataset acquired from Twitter. We provide evidence that deep features acquired by contrastive learning are more efficient, suggesting that the proposed fusion technique also possesses strong generalisation capabilities for recognising emoticons across several modes.
title Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2408.02571