Saved in:
Bibliographic Details
Main Authors: Fang, Zhihua, Tao, Shumei, Wang, Junxu, He, Liang
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.06757
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913068589514752
author Fang, Zhihua
Tao, Shumei
Wang, Junxu
He, Liang
author_facet Fang, Zhihua
Tao, Shumei
Wang, Junxu
He, Liang
contents This paper introduces our solution, XM-ALIGN (Unified Cross-Modal Embedding Alignment Framework), proposed for the FAME challenge at ICASSP 2026. Our framework combines explicit and implicit alignment mechanisms, significantly improving cross-modal verification performance in both "heard" and "unheard" languages. By extracting feature embeddings from both face and voice encoders and jointly optimizing them using a shared classifier, we employ mean squared error (MSE) as the embedding alignment loss to ensure tight alignment between modalities. Additionally, data augmentation strategies are applied during model training to enhance generalization. Experimental results show that our approach demonstrates superior performance on the MAV-Celeb dataset. The code will be released at https://github.com/PunkMale/XM-ALIGN.
format Preprint
id arxiv_https___arxiv_org_abs_2512_06757
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle XM-ALIGN: Unified Cross-Modal Embedding Alignment for Face-Voice Association
Fang, Zhihua
Tao, Shumei
Wang, Junxu
He, Liang
Sound
Computer Vision and Pattern Recognition
This paper introduces our solution, XM-ALIGN (Unified Cross-Modal Embedding Alignment Framework), proposed for the FAME challenge at ICASSP 2026. Our framework combines explicit and implicit alignment mechanisms, significantly improving cross-modal verification performance in both "heard" and "unheard" languages. By extracting feature embeddings from both face and voice encoders and jointly optimizing them using a shared classifier, we employ mean squared error (MSE) as the embedding alignment loss to ensure tight alignment between modalities. Additionally, data augmentation strategies are applied during model training to enhance generalization. Experimental results show that our approach demonstrates superior performance on the MAV-Celeb dataset. The code will be released at https://github.com/PunkMale/XM-ALIGN.
title XM-ALIGN: Unified Cross-Modal Embedding Alignment for Face-Voice Association
topic Sound
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2512.06757