Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Mengmeng, Jiang, Dengyang, Li, Liuzhuozheng, Lin, Yucheng, Shen, Guojiang, Kong, Xiangjie, Liu, Yong, Dai, Guang, Wang, Jingdong
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2601.17830
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918375005880320
author	Wang, Mengmeng Jiang, Dengyang Li, Liuzhuozheng Lin, Yucheng Shen, Guojiang Kong, Xiangjie Liu, Yong Dai, Guang Wang, Jingdong
author_facet	Wang, Mengmeng Jiang, Dengyang Li, Liuzhuozheng Lin, Yucheng Shen, Guojiang Kong, Xiangjie Liu, Yong Dai, Guang Wang, Jingdong
contents	Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes SRA 2, a lightweight intrinsic guidance framework for efficient diffusion training. SRA 2 leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, SRA 2 aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that SRA 2 improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4% extra GFLOPs with zero additional cost for external guidance models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_17830
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training Wang, Mengmeng Jiang, Dengyang Li, Liuzhuozheng Lin, Yucheng Shen, Guojiang Kong, Xiangjie Liu, Yong Dai, Guang Wang, Jingdong Computer Vision and Pattern Recognition Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes SRA 2, a lightweight intrinsic guidance framework for efficient diffusion training. SRA 2 leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, SRA 2 aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that SRA 2 improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4% extra GFLOPs with zero additional cost for external guidance models.
title	SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2601.17830

Similar Items