Achieving fine-grained controllability in human image synthesis is a long-standing challenge in computer vision. Existing methods primarily focus on either facial synthesis or near-frontal body generation, with limited ability to simultaneously control key factors such as viewpoint, pose, clothing, and identity in a disentangled manner. In this paper, we introduce a new disentangled and controllable human synthesis task, which explicitly separates and manipulates these four factors within a unified framework. We first develop an end-to-end generative model trained on MVHumanNet for factor disentanglement. However, the domain gap between MVHumanNet and in-the-wild data produce unsatisfacotry results, motivating the exploration of virtual try-on (VTON) dataset as a potential solution. Through experiments, we observe that simply incorporating the VTON dataset as additional data to train the end-to-end model degrades performance, primarily due to the inconsistency in data forms between the two datasets, which disrupts the disentanglement process. To better leverage both datasets, we propose a stage-by-stage framework that decomposes human image generation into three sequential steps: clothed A-pose generation, back-view synthesis, and pose and view control. This structured pipeline enables better dataset utilization at different stages, significantly improving controllability and generalization, especially for in-the-wild scenarios. Extensive experiments demonstrate that our stage-by-stage approach outperforms end-to-end models in both visual fidelity and disentanglement quality, offering a scalable solution for real-world tasks.
Overview of the proposed pipelines. (a) The end-to-end pipeline directly synthesizes the final image from disentangled inputs, including a face image, clothing images, and a pose map. (b) The stage-by-stage pipeline decomposes the process into three steps: front-view synthesis with identity and clothing control, back-view synthesis, and free-view synthesis under the target pose and viewpoint. Both pipelines are implemented using DiscoHuman, with details provided in below.
DiscoHuman model \( \varepsilon \) consists of a VisualDiT \( \varepsilon_V \) and a HumanDiT \( \varepsilon_H \). The VisualDiT is responsible for encoding visual conditions, with different input settings depending on the pipeline or stage in which DiscoHuman is applied. The upper left blocks illustrate three possible input configurations. In this figure, the active configuration corresponds to Stage 3, while the inactive settings are indicated by grey dashed lines. To maintain simplicity, the denoising timestep t is not shown in this figure.
Comparison on MVHumanNet (1).
Comparison on MVHumanNet (2).
Comparison on THuman 4.0 and AvatarRex dataset.
Comparison on in-the-wild data generated by Stable Diffusion.
We compare our method against AA (AnimateAnyone), MA (MagicAnimate), and CFLD. Our method achieves the best performance across multiple datasets.
@article{sun2025exploring,
title={Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage},
author={Sun, Zhengwentai and Li, Heyuan and Yang, Xihe and Zheng, Keru and Ning, Shuliang and Zhi, Yihao and Liao, Hongjie and Li, Chenghong and Cui, Shuguang and Han, Xiaoguang},
journal={arXiv preprint arXiv:2503.19486},
year={2025}
}