Saved in:
Bibliographic Details
Main Authors: Wu, Ge, Zhang, Shen, Shi, Ruijing, Gao, Shanghua, Chen, Zhenyuan, Wang, Lei, Chen, Zhaowei, Gao, Hongcheng, Tang, Yao, Yang, Jian, Cheng, Ming-Ming, Li, Xiang
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.01467
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909811932659712
author Wu, Ge
Zhang, Shen
Shi, Ruijing
Gao, Shanghua
Chen, Zhenyuan
Wang, Lei
Chen, Zhaowei
Gao, Hongcheng
Tang, Yao
Yang, Jian
Cheng, Ming-Ming
Li, Xiang
author_facet Wu, Ge
Zhang, Shen
Shi, Ruijing
Gao, Shanghua
Chen, Zhenyuan
Wang, Lei
Chen, Zhaowei
Gao, Hongcheng
Tang, Yao
Yang, Jian
Cheng, Ming-Ming
Li, Xiang
contents REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0.5\% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$\times$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer). Code is available at: https://github.com/Martinser/REG.
format Preprint
id arxiv_https___arxiv_org_abs_2507_01467
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think
Wu, Ge
Zhang, Shen
Shi, Ruijing
Gao, Shanghua
Chen, Zhenyuan
Wang, Lei
Chen, Zhaowei
Gao, Hongcheng
Tang, Yao
Yang, Jian
Cheng, Ming-Ming
Li, Xiang
Computer Vision and Pattern Recognition
REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0.5\% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$\times$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer). Code is available at: https://github.com/Martinser/REG.
title Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2507.01467