Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ding, Zijun, Xiong, Mingdie, Zhu, Congcong, Chen, Jingrun
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2503.23039
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910898985107456
author	Ding, Zijun Xiong, Mingdie Zhu, Congcong Chen, Jingrun
author_facet	Ding, Zijun Xiong, Mingdie Zhu, Congcong Chen, Jingrun
contents	Existing audio-driven visual dubbing methods have achieved great success. Despite this, we observe that the semantic ambiguity between spatial and temporal domains significantly degrades the synthesis stability for the dynamic faces. We argue that aligning the semantic features from spatial and temporal domains is a promising approach to stabilizing facial motion. To achieve this, we propose a Spatial-Temporal Semantic Alignment (STSA) method, which introduces a dual-path alignment mechanism and a differentiable semantic representation. The former leverages a Consistent Information Learning (CIL) module to maximize the mutual information at multiple scales, thereby reducing the manifold differences between spatial and temporal domains. The latter utilizes probabilistic heatmap as ambiguity-tolerant guidance to avoid the abnormal dynamics of the synthesized faces caused by slight semantic jittering. Extensive experimental results demonstrate the superiority of the proposed STSA, especially in terms of image quality and synthesis stability. Pre-trained weights and inference code are available at https://github.com/SCAILab-USTC/STSA.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_23039
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing Ding, Zijun Xiong, Mingdie Zhu, Congcong Chen, Jingrun Computer Vision and Pattern Recognition Artificial Intelligence Existing audio-driven visual dubbing methods have achieved great success. Despite this, we observe that the semantic ambiguity between spatial and temporal domains significantly degrades the synthesis stability for the dynamic faces. We argue that aligning the semantic features from spatial and temporal domains is a promising approach to stabilizing facial motion. To achieve this, we propose a Spatial-Temporal Semantic Alignment (STSA) method, which introduces a dual-path alignment mechanism and a differentiable semantic representation. The former leverages a Consistent Information Learning (CIL) module to maximize the mutual information at multiple scales, thereby reducing the manifold differences between spatial and temporal domains. The latter utilizes probabilistic heatmap as ambiguity-tolerant guidance to avoid the abnormal dynamics of the synthesized faces caused by slight semantic jittering. Extensive experimental results demonstrate the superiority of the proposed STSA, especially in terms of image quality and synthesis stability. Pre-trained weights and inference code are available at https://github.com/SCAILab-USTC/STSA.
title	STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2503.23039

Similar Items