Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yu, Jinghan, Xiao, Junhao, Ma, Zhiyuan, Ma, Yue, Liu, Kaiqi, Wang, Yuhan, Liu, Daizong, Meng, Xianghao, Li, Jianjun
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2508.06543
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909901366755328
author	Yu, Jinghan Xiao, Junhao Ma, Zhiyuan Ma, Yue Liu, Kaiqi Wang, Yuhan Liu, Daizong Meng, Xianghao Li, Jianjun
author_facet	Yu, Jinghan Xiao, Junhao Ma, Zhiyuan Ma, Yue Liu, Kaiqi Wang, Yuhan Liu, Daizong Meng, Xianghao Li, Jianjun
contents	Recent years have witnessed the success of diffusion models in image customization tasks. However, existing mask-guided human erasing methods still struggle in complex scenarios such as human-human occlusion, human-object entanglement, and human-background interference, mainly due to the lack of large-scale multi-instance datasets and effective spatial decoupling to separate foreground from background. To bridge these gaps, we curate the MILD dataset capturing diverse poses, occlusions, and complex multi-instance interactions. We then define the Cross-Domain Attention Gap (CAG), an attention-gap metric to quantify semantic leakage. On top of these, we propose Multi-Layer Diffusion (MILD), which decomposes the generation process into independent denoising pathways, enabling separate reconstruction of each foreground instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, a plug-and-play module that incorporates pose, parsing, and spatial relationships into the diffusion process to improve structural awareness and restoration quality. Additionally, we present Spatially-Modulated Attention, an adaptive mechanism that leverages spatial mask priors to modulate attention across semantic regions, further widening the CAG to effectively minimize boundary artifacts and mitigate semantic leakage. Experiments show that MILD significantly outperforms existing methods. Datasets and code are publicly available at: https://mild-multi-layer-diffusion.github.io/.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_06543
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing Yu, Jinghan Xiao, Junhao Ma, Zhiyuan Ma, Yue Liu, Kaiqi Wang, Yuhan Liu, Daizong Meng, Xianghao Li, Jianjun Computer Vision and Pattern Recognition Recent years have witnessed the success of diffusion models in image customization tasks. However, existing mask-guided human erasing methods still struggle in complex scenarios such as human-human occlusion, human-object entanglement, and human-background interference, mainly due to the lack of large-scale multi-instance datasets and effective spatial decoupling to separate foreground from background. To bridge these gaps, we curate the MILD dataset capturing diverse poses, occlusions, and complex multi-instance interactions. We then define the Cross-Domain Attention Gap (CAG), an attention-gap metric to quantify semantic leakage. On top of these, we propose Multi-Layer Diffusion (MILD), which decomposes the generation process into independent denoising pathways, enabling separate reconstruction of each foreground instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, a plug-and-play module that incorporates pose, parsing, and spatial relationships into the diffusion process to improve structural awareness and restoration quality. Additionally, we present Spatially-Modulated Attention, an adaptive mechanism that leverages spatial mask priors to modulate attention across semantic regions, further widening the CAG to effectively minimize boundary artifacts and mitigate semantic leakage. Experiments show that MILD significantly outperforms existing methods. Datasets and code are publicly available at: https://mild-multi-layer-diffusion.github.io/.
title	MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2508.06543

Similar Items