Saved in:
Bibliographic Details
Main Authors: Ding, Zijun, Xiong, Mingdie, Zhu, Congcong, Chen, Jingrun
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.23039
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910898985107456
author Ding, Zijun
Xiong, Mingdie
Zhu, Congcong
Chen, Jingrun
author_facet Ding, Zijun
Xiong, Mingdie
Zhu, Congcong
Chen, Jingrun
contents Existing audio-driven visual dubbing methods have achieved great success. Despite this, we observe that the semantic ambiguity between spatial and temporal domains significantly degrades the synthesis stability for the dynamic faces. We argue that aligning the semantic features from spatial and temporal domains is a promising approach to stabilizing facial motion. To achieve this, we propose a Spatial-Temporal Semantic Alignment (STSA) method, which introduces a dual-path alignment mechanism and a differentiable semantic representation. The former leverages a Consistent Information Learning (CIL) module to maximize the mutual information at multiple scales, thereby reducing the manifold differences between spatial and temporal domains. The latter utilizes probabilistic heatmap as ambiguity-tolerant guidance to avoid the abnormal dynamics of the synthesized faces caused by slight semantic jittering. Extensive experimental results demonstrate the superiority of the proposed STSA, especially in terms of image quality and synthesis stability. Pre-trained weights and inference code are available at https://github.com/SCAILab-USTC/STSA.
format Preprint
id arxiv_https___arxiv_org_abs_2503_23039
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing
Ding, Zijun
Xiong, Mingdie
Zhu, Congcong
Chen, Jingrun
Computer Vision and Pattern Recognition
Artificial Intelligence
Existing audio-driven visual dubbing methods have achieved great success. Despite this, we observe that the semantic ambiguity between spatial and temporal domains significantly degrades the synthesis stability for the dynamic faces. We argue that aligning the semantic features from spatial and temporal domains is a promising approach to stabilizing facial motion. To achieve this, we propose a Spatial-Temporal Semantic Alignment (STSA) method, which introduces a dual-path alignment mechanism and a differentiable semantic representation. The former leverages a Consistent Information Learning (CIL) module to maximize the mutual information at multiple scales, thereby reducing the manifold differences between spatial and temporal domains. The latter utilizes probabilistic heatmap as ambiguity-tolerant guidance to avoid the abnormal dynamics of the synthesized faces caused by slight semantic jittering. Extensive experimental results demonstrate the superiority of the proposed STSA, especially in terms of image quality and synthesis stability. Pre-trained weights and inference code are available at https://github.com/SCAILab-USTC/STSA.
title STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2503.23039