Saved in:
Bibliographic Details
Main Authors: Cai, Honghao, Wang, Xiangyuan, Li, Jing, Bai, Yunhao, Zhou, Tianze, Chen, Haohua, Hui, Chao, Qiao, Changhao, Wang, Runqi, Xu, Sijie, Hao, Yuyang, Cui, Zezhou, Yang, Yuyuan, Zhu, Wei, Chen, Yibo, Tang, Xu, Hu, Yao, Li, Zhen
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.00607
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914573759545344
author Cai, Honghao
Wang, Xiangyuan
Li, Jing
Bai, Yunhao
Zhou, Tianze
Chen, Haohua
Hui, Chao
Qiao, Changhao
Wang, Runqi
Xu, Sijie
Hao, Yuyang
Cui, Zezhou
Yang, Yuyuan
Zhu, Wei
Chen, Yibo
Tang, Xu
Hu, Yao
Li, Zhen
author_facet Cai, Honghao
Wang, Xiangyuan
Li, Jing
Bai, Yunhao
Zhou, Tianze
Chen, Haohua
Hui, Chao
Qiao, Changhao
Wang, Runqi
Xu, Sijie
Hao, Yuyang
Cui, Zezhou
Yang, Yuyuan
Zhu, Wei
Chen, Yibo
Tang, Xu
Hu, Yao
Li, Zhen
contents Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.
format Preprint
id arxiv_https___arxiv_org_abs_2603_00607
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle IdGlow: Dynamic Identity Modulation for Multi-Subject Generation
Cai, Honghao
Wang, Xiangyuan
Li, Jing
Bai, Yunhao
Zhou, Tianze
Chen, Haohua
Hui, Chao
Qiao, Changhao
Wang, Runqi
Xu, Sijie
Hao, Yuyang
Cui, Zezhou
Yang, Yuyuan
Zhu, Wei
Chen, Yibo
Tang, Xu
Hu, Yao
Li, Zhen
Computer Vision and Pattern Recognition
Artificial Intelligence
Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.
title IdGlow: Dynamic Identity Modulation for Multi-Subject Generation
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2603.00607