Saved in:
Bibliographic Details
Main Authors: Wang, Mengmeng, Jiang, Dengyang, Li, Liuzhuozheng, Lin, Yucheng, Shen, Guojiang, Kong, Xiangjie, Liu, Yong, Dai, Guang, Wang, Jingdong
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.17830
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918375005880320
author Wang, Mengmeng
Jiang, Dengyang
Li, Liuzhuozheng
Lin, Yucheng
Shen, Guojiang
Kong, Xiangjie
Liu, Yong
Dai, Guang
Wang, Jingdong
author_facet Wang, Mengmeng
Jiang, Dengyang
Li, Liuzhuozheng
Lin, Yucheng
Shen, Guojiang
Kong, Xiangjie
Liu, Yong
Dai, Guang
Wang, Jingdong
contents Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes SRA 2, a lightweight intrinsic guidance framework for efficient diffusion training. SRA 2 leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, SRA 2 aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that SRA 2 improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4% extra GFLOPs with zero additional cost for external guidance models.
format Preprint
id arxiv_https___arxiv_org_abs_2601_17830
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training
Wang, Mengmeng
Jiang, Dengyang
Li, Liuzhuozheng
Lin, Yucheng
Shen, Guojiang
Kong, Xiangjie
Liu, Yong
Dai, Guang
Wang, Jingdong
Computer Vision and Pattern Recognition
Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes SRA 2, a lightweight intrinsic guidance framework for efficient diffusion training. SRA 2 leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, SRA 2 aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that SRA 2 improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4% extra GFLOPs with zero additional cost for external guidance models.
title SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2601.17830