Saved in:
Bibliographic Details
Main Authors: Chen, Guanjie, Zhao, Xinyu, Zhou, Yucheng, Qu, Xiaoye, Chen, Tianlong, Cheng, Yu
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2411.17616
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917255632125952
author Chen, Guanjie
Zhao, Xinyu
Zhou, Yucheng
Qu, Xiaoye
Chen, Tianlong
Cheng, Yu
author_facet Chen, Guanjie
Zhao, Xinyu
Zhou, Yucheng
Qu, Xiaoye
Chen, Tianlong
Cheng, Yu
contents Diffusion Transformers (DiT) have emerged as a powerful architecture for image and video generation, offering superior quality and scalability. However, their practical application suffers from inherent dynamic feature instability, leading to error amplification during cached inference. Through systematic analysis, we identify the absence of long-range feature preservation mechanisms as the root cause of unstable feature propagation and perturbation sensitivity. To this end, we propose Skip-DiT, an image and video generative DiT variant enhanced with Long-Skip-Connections (LSCs) - the key efficiency component in U-Nets. Theoretical spectral norm and visualization analysis demonstrate how LSCs stabilize feature dynamics. Skip-DiT architecture and its stabilized dynamic feature enable an efficient statical caching mechanism that reuses deep features across timesteps while updating shallow components. Extensive experiments across the image and video generation tasks demonstrate that Skip-DiT achieves: (1) 4.4 times training acceleration and faster convergence, (2) 1.5-2 times inference acceleration with negligible quality loss and high fidelity to the original output, outperforming existing DiT caching methods across various quantitative metrics. Our findings establish Long-Skip-Connections as critical architectural components for stable and efficient diffusion transformers. Codes are provided in the https://github.com/OpenSparseLLMs/Skip-DiT.
format Preprint
id arxiv_https___arxiv_org_abs_2411_17616
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints
Chen, Guanjie
Zhao, Xinyu
Zhou, Yucheng
Qu, Xiaoye
Chen, Tianlong
Cheng, Yu
Computer Vision and Pattern Recognition
Diffusion Transformers (DiT) have emerged as a powerful architecture for image and video generation, offering superior quality and scalability. However, their practical application suffers from inherent dynamic feature instability, leading to error amplification during cached inference. Through systematic analysis, we identify the absence of long-range feature preservation mechanisms as the root cause of unstable feature propagation and perturbation sensitivity. To this end, we propose Skip-DiT, an image and video generative DiT variant enhanced with Long-Skip-Connections (LSCs) - the key efficiency component in U-Nets. Theoretical spectral norm and visualization analysis demonstrate how LSCs stabilize feature dynamics. Skip-DiT architecture and its stabilized dynamic feature enable an efficient statical caching mechanism that reuses deep features across timesteps while updating shallow components. Extensive experiments across the image and video generation tasks demonstrate that Skip-DiT achieves: (1) 4.4 times training acceleration and faster convergence, (2) 1.5-2 times inference acceleration with negligible quality loss and high fidelity to the original output, outperforming existing DiT caching methods across various quantitative metrics. Our findings establish Long-Skip-Connections as critical architectural components for stable and efficient diffusion transformers. Codes are provided in the https://github.com/OpenSparseLLMs/Skip-DiT.
title Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2411.17616