Saved in:
Bibliographic Details
Main Authors: Peng, Chong, He, Liqiang, Su, Dan
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2404.09509
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909169807785984
author Peng, Chong
He, Liqiang
Su, Dan
author_facet Peng, Chong
He, Liqiang
Su, Dan
contents Today, there have been many achievements in learning the association between voice and face. However, most previous work models rely on cosine similarity or L2 distance to evaluate the likeness of voices and faces following contrastive learning, subsequently applied to retrieval and matching tasks. This method only considers the embeddings as high-dimensional vectors, utilizing a minimal scope of available information. This paper introduces a novel framework within an unsupervised setting for learning voice-face associations. By employing a multimodal encoder after contrastive learning and addressing the problem through binary classification, we can learn the implicit information within the embeddings in a more effective and varied manner. Furthermore, by introducing an effective pair selection method, we enhance the learning outcomes of both contrastive learning and the matching task. Empirical evidence demonstrates that our framework achieves state-of-the-art results in voice-face matching, verification, and retrieval tasks, improving verification by approximately 3%, matching by about 2.5%, and retrieval by around 1.3%.
format Preprint
id arxiv_https___arxiv_org_abs_2404_09509
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder
Peng, Chong
He, Liqiang
Su, Dan
Computer Vision and Pattern Recognition
Today, there have been many achievements in learning the association between voice and face. However, most previous work models rely on cosine similarity or L2 distance to evaluate the likeness of voices and faces following contrastive learning, subsequently applied to retrieval and matching tasks. This method only considers the embeddings as high-dimensional vectors, utilizing a minimal scope of available information. This paper introduces a novel framework within an unsupervised setting for learning voice-face associations. By employing a multimodal encoder after contrastive learning and addressing the problem through binary classification, we can learn the implicit information within the embeddings in a more effective and varied manner. Furthermore, by introducing an effective pair selection method, we enhance the learning outcomes of both contrastive learning and the matching task. Empirical evidence demonstrates that our framework achieves state-of-the-art results in voice-face matching, verification, and retrieval tasks, improving verification by approximately 3%, matching by about 2.5%, and retrieval by around 1.3%.
title Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2404.09509