Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.10568 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866914007960518656 |
|---|---|
| author | Wang, Lizhen Xia, Zhurong Hu, Tianshu Wang, Pengrui Wei, Pengfei Zheng, Zerong Zhou, Ming Zhang, Yuan Gao, Mingyuan |
| author_facet | Wang, Lizhen Xia, Zhurong Hu, Tianshu Wang, Pengrui Wei, Pengfei Zheng, Zerong Zhou, Ming Zhang, Yuan Gao, Mingyuan |
| contents | In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://lizhenwangt.github.io/DreamActor-H1/. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2506_10568 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers Wang, Lizhen Xia, Zhurong Hu, Tianshu Wang, Pengrui Wei, Pengfei Zheng, Zerong Zhou, Ming Zhang, Yuan Gao, Mingyuan Computer Vision and Pattern Recognition Artificial Intelligence In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://lizhenwangt.github.io/DreamActor-H1/. |
| title | DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers |
| topic | Computer Vision and Pattern Recognition Artificial Intelligence |
| url | https://arxiv.org/abs/2506.10568 |