Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Sun, Zhengwentai, Li, Chenghong, Liao, Hongjie, Yang, Xihe, Zheng, Keru, Li, Heyuan, Zhi, Yihao, Ning, Shuliang, Cui, Shuguang, Han, Xiaoguang
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.19486
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908902423003136
author	Sun, Zhengwentai Li, Chenghong Liao, Hongjie Yang, Xihe Zheng, Keru Li, Heyuan Zhi, Yihao Ning, Shuliang Cui, Shuguang Han, Xiaoguang
author_facet	Sun, Zhengwentai Li, Chenghong Liao, Hongjie Yang, Xihe Zheng, Keru Li, Heyuan Zhi, Yihao Ning, Shuliang Cui, Shuguang Han, Xiaoguang
contents	Achieving fine-grained controllability in human image synthesis is a long-standing challenge in computer vision. Existing methods primarily focus on either facial synthesis or near-frontal body generation, with limited ability to simultaneously control key factors such as viewpoint, pose, clothing, and identity in a disentangled manner. In this paper, we introduce a new disentangled and controllable human synthesis task, which explicitly separates and manipulates these four factors within a unified framework. We first develop an end-to-end generative model trained on MVHumanNet for factor disentanglement. However, the domain gap between MVHumanNet and in-the-wild data produces unsatisfactory results, motivating the exploration of virtual try-on (VTON) dataset as a potential solution. Through experiments, we observe that simply incorporating the VTON dataset as additional data to train the end-to-end model degrades performance, primarily due to the inconsistency in data forms between the two datasets, which disrupts the disentanglement process. To better leverage both datasets, we propose a stage-by-stage framework that decomposes human image generation into three sequential steps: clothed A-pose generation, back-view synthesis, and pose and view control. This structured pipeline enables better dataset utilization at different stages, significantly improving controllability and generalization, especially for in-the-wild scenarios. Extensive experiments demonstrate that our stage-by-stage approach outperforms end-to-end models in both visual fidelity and disentanglement quality, offering a scalable solution for real-world tasks. Additional demos are available on the project page: https://taited.github.io/discohuman-project/.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_19486
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage Sun, Zhengwentai Li, Chenghong Liao, Hongjie Yang, Xihe Zheng, Keru Li, Heyuan Zhi, Yihao Ning, Shuliang Cui, Shuguang Han, Xiaoguang Computer Vision and Pattern Recognition Achieving fine-grained controllability in human image synthesis is a long-standing challenge in computer vision. Existing methods primarily focus on either facial synthesis or near-frontal body generation, with limited ability to simultaneously control key factors such as viewpoint, pose, clothing, and identity in a disentangled manner. In this paper, we introduce a new disentangled and controllable human synthesis task, which explicitly separates and manipulates these four factors within a unified framework. We first develop an end-to-end generative model trained on MVHumanNet for factor disentanglement. However, the domain gap between MVHumanNet and in-the-wild data produces unsatisfactory results, motivating the exploration of virtual try-on (VTON) dataset as a potential solution. Through experiments, we observe that simply incorporating the VTON dataset as additional data to train the end-to-end model degrades performance, primarily due to the inconsistency in data forms between the two datasets, which disrupts the disentanglement process. To better leverage both datasets, we propose a stage-by-stage framework that decomposes human image generation into three sequential steps: clothed A-pose generation, back-view synthesis, and pose and view control. This structured pipeline enables better dataset utilization at different stages, significantly improving controllability and generalization, especially for in-the-wild scenarios. Extensive experiments demonstrate that our stage-by-stage approach outperforms end-to-end models in both visual fidelity and disentanglement quality, offering a scalable solution for real-world tasks. Additional demos are available on the project page: https://taited.github.io/discohuman-project/.
title	Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.19486

Similar Items