Saved in:
Bibliographic Details
Main Authors: Sun, Huiming, Li, Yikang, Yang, Kangning, Li, Ruineng, Xing, Daitao, Xie, Yangbo, Fu, Lan, Zhang, Kaiyu, Chen, Ming, Ding, Jiaming, Geng, Jiang, Cai, Jie, Meng, Zibo, Ho, Chiuman
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2504.03041
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Inpainting for real-world human and pedestrian removal in high-resolution video clips presents significant challenges, particularly in achieving high-quality outcomes, ensuring temporal consistency, and managing complex object interactions that involve humans, their belongings, and their shadows. In this paper, we introduce VIP (Video Inpainting Pipeline), a novel promptless video inpainting framework for real-world human removal applications. VIP enhances a state-of-the-art text-to-video model with a motion module and employs a Variational Autoencoder (VAE) for progressive denoising in the latent space. Additionally, we implement an efficient human-and-belongings segmentation for precise mask generation. Sufficient experimental results demonstrate that VIP achieves superior temporal consistency and visual fidelity across diverse real-world scenarios, surpassing state-of-the-art methods on challenging datasets. Our key contributions include the development of the VIP pipeline, a reference frame integration technique, and the Dual-Fusion Latent Segment Refinement method, all of which address the complexities of inpainting in long, high-resolution video sequences.