Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wei, Zhichao, Su, Qingkun, Qin, Long, Wang, Weizhi
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2403.15059
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914724759732224
author	Wei, Zhichao Su, Qingkun Qin, Long Wang, Weizhi
author_facet	Wei, Zhichao Su, Qingkun Qin, Long Wang, Weizhi
contents	Recent advances in tuning-free personalized image generation based on diffusion models are impressive. However, to improve subject fidelity, existing methods either retrain the diffusion model or infuse it with dense visual embeddings, both of which suffer from poor generalization and efficiency. Also, these methods falter in multi-subject image generation due to the unconstrained cross-attention mechanism. In this paper, we propose MM-Diff, a unified and tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds. Specifically, to simultaneously enhance text consistency and subject fidelity, MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings. CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings, both of which are efficiently integrated into the diffusion model through the well-designed multimodal cross-attention mechanism. Additionally, MM-Diff introduces cross-attention map constraints during the training phase, ensuring flexible multi-subject image sampling during inference without any predefined inputs (e.g., layout). Extensive experiments demonstrate the superior performance of MM-Diff over other leading methods.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_15059
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration Wei, Zhichao Su, Qingkun Qin, Long Wang, Weizhi Computer Vision and Pattern Recognition Artificial Intelligence Recent advances in tuning-free personalized image generation based on diffusion models are impressive. However, to improve subject fidelity, existing methods either retrain the diffusion model or infuse it with dense visual embeddings, both of which suffer from poor generalization and efficiency. Also, these methods falter in multi-subject image generation due to the unconstrained cross-attention mechanism. In this paper, we propose MM-Diff, a unified and tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds. Specifically, to simultaneously enhance text consistency and subject fidelity, MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings. CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings, both of which are efficiently integrated into the diffusion model through the well-designed multimodal cross-attention mechanism. Additionally, MM-Diff introduces cross-attention map constraints during the training phase, ensuring flexible multi-subject image sampling during inference without any predefined inputs (e.g., layout). Extensive experiments demonstrate the superior performance of MM-Diff over other leading methods.
title	MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2403.15059

Similar Items