Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.17830 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866918375005880320 |
|---|---|
| author | Wang, Mengmeng Jiang, Dengyang Li, Liuzhuozheng Lin, Yucheng Shen, Guojiang Kong, Xiangjie Liu, Yong Dai, Guang Wang, Jingdong |
| author_facet | Wang, Mengmeng Jiang, Dengyang Li, Liuzhuozheng Lin, Yucheng Shen, Guojiang Kong, Xiangjie Liu, Yong Dai, Guang Wang, Jingdong |
| contents | Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes SRA 2, a lightweight intrinsic guidance framework for efficient diffusion training. SRA 2 leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, SRA 2 aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that SRA 2 improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4% extra GFLOPs with zero additional cost for external guidance models. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2601_17830 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training Wang, Mengmeng Jiang, Dengyang Li, Liuzhuozheng Lin, Yucheng Shen, Guojiang Kong, Xiangjie Liu, Yong Dai, Guang Wang, Jingdong Computer Vision and Pattern Recognition Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes SRA 2, a lightweight intrinsic guidance framework for efficient diffusion training. SRA 2 leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, SRA 2 aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that SRA 2 improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4% extra GFLOPs with zero additional cost for external guidance models. |
| title | SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2601.17830 |