Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.00607 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866914573759545344 |
|---|---|
| author | Cai, Honghao Wang, Xiangyuan Li, Jing Bai, Yunhao Zhou, Tianze Chen, Haohua Hui, Chao Qiao, Changhao Wang, Runqi Xu, Sijie Hao, Yuyang Cui, Zezhou Yang, Yuyuan Zhu, Wei Chen, Yibo Tang, Xu Hu, Yao Li, Zhen |
| author_facet | Cai, Honghao Wang, Xiangyuan Li, Jing Bai, Yunhao Zhou, Tianze Chen, Haohua Hui, Chao Qiao, Changhao Wang, Runqi Xu, Sijie Hao, Yuyang Cui, Zezhou Yang, Yuyuan Zhu, Wei Chen, Yibo Tang, Xu Hu, Yao Li, Zhen |
| contents | Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2603_00607 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | IdGlow: Dynamic Identity Modulation for Multi-Subject Generation Cai, Honghao Wang, Xiangyuan Li, Jing Bai, Yunhao Zhou, Tianze Chen, Haohua Hui, Chao Qiao, Changhao Wang, Runqi Xu, Sijie Hao, Yuyang Cui, Zezhou Yang, Yuyuan Zhu, Wei Chen, Yibo Tang, Xu Hu, Yao Li, Zhen Computer Vision and Pattern Recognition Artificial Intelligence Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality. |
| title | IdGlow: Dynamic Identity Modulation for Multi-Subject Generation |
| topic | Computer Vision and Pattern Recognition Artificial Intelligence |
| url | https://arxiv.org/abs/2603.00607 |