Na minha lista:
Detalhes bibliográficos
Main Authors: Yu, Runyi, He, Tianyu, Zhang, Ailing, Wang, Yuchi, Guo, Junliang, Tan, Xu, Liu, Chang, Chen, Jie, Bian, Jiang
Formato: Preprint
Publicado em: 2024
Assuntos:
Acesso em linha:https://arxiv.org/abs/2406.08096
Tags: Adicionar Tag
Sem tags, seja o primeiro a adicionar uma tag!
Sumário:
  • We aim to edit the lip movements in talking video according to the given speech while preserving the personal identity and visual details. The task can be decomposed into two sub-problems: (1) speech-driven lip motion generation and (2) visual appearance synthesis. Current solutions handle the two sub-problems within a single generative model, resulting in a challenging trade-off between lip-sync quality and visual details preservation. Instead, we propose to disentangle the motion and appearance, and then generate them one by one with a speech-to-motion diffusion model and a motion-conditioned appearance generation model. However, there still remain challenges in each stage, such as motion-aware identity preservation in (1) and visual details preservation in (2). Therefore, to preserve personal identity, we adopt landmarks to represent the motion, and further employ a landmark-based identity loss. To capture motion-agnostic visual details, we use separate encoders to encode the lip, non-lip appearance and motion, and then integrate them with a learned fusion module. We train MyTalk on a large-scale and diverse dataset. Experiments show that our method generalizes well to the unknown, even out-of-domain person, in terms of both lip sync and visual detail preservation. We encourage the readers to watch the videos on our project page (https://Ingrid789.github.io/MyTalk/).