Saved in:
Bibliographic Details
Main Authors: Sun, Zelong, Wu, Jiahui, Ba, Ying, Jing, Dong, Lu, Zhiwu
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.20511
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911404341067776
author Sun, Zelong
Wu, Jiahui
Ba, Ying
Jing, Dong
Lu, Zhiwu
author_facet Sun, Zelong
Wu, Jiahui
Ba, Ying
Jing, Dong
Lu, Zhiwu
contents As social media platforms proliferate, users increasingly demand intuitive ways to create diverse, high-quality portrait collections. In this work, we introduce Portrait Collection Generation (PCG), a novel task that generates coherent portrait collections by editing a reference portrait image through natural language instructions. This task poses two unique challenges to existing methods: (1) complex multi-attribute modifications such as pose, spatial layout, and camera viewpoint; and (2) high-fidelity detail preservation including identity, clothing, and accessories. To address these challenges, we propose CHEESE, the first large-scale PCG dataset containing 24K portrait collections and 573K samples with high-quality modification text annotations, constructed through an Large Vison-Language Model-based pipeline with inversion-based verification. We further propose SCheese, a framework that combines text-guided generation with hierarchical identity and detail preservation. SCheese employs adaptive feature fusion mechanism to maintain identity consistency, and ConsistencyNet to inject fine-grained features for detail consistency. Comprehensive experiments validate the effectiveness of CHEESE in advancing PCG, with SCheese achieving state-of-the-art performance.
format Preprint
id arxiv_https___arxiv_org_abs_2601_20511
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits
Sun, Zelong
Wu, Jiahui
Ba, Ying
Jing, Dong
Lu, Zhiwu
Computer Vision and Pattern Recognition
As social media platforms proliferate, users increasingly demand intuitive ways to create diverse, high-quality portrait collections. In this work, we introduce Portrait Collection Generation (PCG), a novel task that generates coherent portrait collections by editing a reference portrait image through natural language instructions. This task poses two unique challenges to existing methods: (1) complex multi-attribute modifications such as pose, spatial layout, and camera viewpoint; and (2) high-fidelity detail preservation including identity, clothing, and accessories. To address these challenges, we propose CHEESE, the first large-scale PCG dataset containing 24K portrait collections and 573K samples with high-quality modification text annotations, constructed through an Large Vison-Language Model-based pipeline with inversion-based verification. We further propose SCheese, a framework that combines text-guided generation with hierarchical identity and detail preservation. SCheese employs adaptive feature fusion mechanism to maintain identity consistency, and ConsistencyNet to inject fine-grained features for detail consistency. Comprehensive experiments validate the effectiveness of CHEESE in advancing PCG, with SCheese achieving state-of-the-art performance.
title Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2601.20511