Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Fang, Zhihua, Tao, Shumei, Wang, Junxu, He, Liang
Format:	Preprint
Published:	2025
Subjects:	Sound Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2512.06757
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913068589514752
author	Fang, Zhihua Tao, Shumei Wang, Junxu He, Liang
author_facet	Fang, Zhihua Tao, Shumei Wang, Junxu He, Liang
contents	This paper introduces our solution, XM-ALIGN (Unified Cross-Modal Embedding Alignment Framework), proposed for the FAME challenge at ICASSP 2026. Our framework combines explicit and implicit alignment mechanisms, significantly improving cross-modal verification performance in both "heard" and "unheard" languages. By extracting feature embeddings from both face and voice encoders and jointly optimizing them using a shared classifier, we employ mean squared error (MSE) as the embedding alignment loss to ensure tight alignment between modalities. Additionally, data augmentation strategies are applied during model training to enhance generalization. Experimental results show that our approach demonstrates superior performance on the MAV-Celeb dataset. The code will be released at https://github.com/PunkMale/XM-ALIGN.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_06757
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	XM-ALIGN: Unified Cross-Modal Embedding Alignment for Face-Voice Association Fang, Zhihua Tao, Shumei Wang, Junxu He, Liang Sound Computer Vision and Pattern Recognition This paper introduces our solution, XM-ALIGN (Unified Cross-Modal Embedding Alignment Framework), proposed for the FAME challenge at ICASSP 2026. Our framework combines explicit and implicit alignment mechanisms, significantly improving cross-modal verification performance in both "heard" and "unheard" languages. By extracting feature embeddings from both face and voice encoders and jointly optimizing them using a shared classifier, we employ mean squared error (MSE) as the embedding alignment loss to ensure tight alignment between modalities. Additionally, data augmentation strategies are applied during model training to enhance generalization. Experimental results show that our approach demonstrates superior performance on the MAV-Celeb dataset. The code will be released at https://github.com/PunkMale/XM-ALIGN.
title	XM-ALIGN: Unified Cross-Modal Embedding Alignment for Face-Voice Association
topic	Sound Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2512.06757

Similar Items