Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shen, Xuan, Han, Chenxia, Zhou, Yufa, Xie, Yanyue, Gong, Yifan, Wang, Quanyi, Wang, Yiwei, Wang, Yanzhi, Zhao, Pu, Gu, Jiuxiang
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.14708
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913849526976512
author	Shen, Xuan Han, Chenxia Zhou, Yufa Xie, Yanyue Gong, Yifan Wang, Quanyi Wang, Yiwei Wang, Yanzhi Zhao, Pu Gu, Jiuxiang
author_facet	Shen, Xuan Han, Chenxia Zhou, Yufa Xie, Yanyue Gong, Yifan Wang, Quanyi Wang, Yiwei Wang, Yanzhi Zhao, Pu Gu, Jiuxiang
contents	Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency, and generating just 8 seconds of 720p video takes tens of minutes-posing serious challenges to practical application and scalability. To address this, we propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs. We apply down-sampling to each feature map across frames in the compressed latent space, enabling a higher-level receptive field over the latent composed of hundreds of thousands of tokens. The low-resolution draft attention map, derived from draft query and key, exposes redundancy both spatially within each feature map and temporally across frames. We reorder the query, key, and value based on the draft attention map to guide the sparse attention computation in full resolution, and subsequently restore their original order after the attention computation. This reordering enables structured sparsity that aligns with hardware-optimized execution. Our theoretical analysis demonstrates that the low-resolution draft attention closely approximates the full attention, providing reliable guidance for constructing accurate sparse attention. Experimental results show that our method outperforms existing sparse attention approaches in video generation quality and achieves up to 1.75x end-to-end speedup on GPUs. Code: https://github.com/shawnricecake/draft-attention
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_14708
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance Shen, Xuan Han, Chenxia Zhou, Yufa Xie, Yanyue Gong, Yifan Wang, Quanyi Wang, Yiwei Wang, Yanzhi Zhao, Pu Gu, Jiuxiang Computer Vision and Pattern Recognition Artificial Intelligence Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency, and generating just 8 seconds of 720p video takes tens of minutes-posing serious challenges to practical application and scalability. To address this, we propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs. We apply down-sampling to each feature map across frames in the compressed latent space, enabling a higher-level receptive field over the latent composed of hundreds of thousands of tokens. The low-resolution draft attention map, derived from draft query and key, exposes redundancy both spatially within each feature map and temporally across frames. We reorder the query, key, and value based on the draft attention map to guide the sparse attention computation in full resolution, and subsequently restore their original order after the attention computation. This reordering enables structured sparsity that aligns with hardware-optimized execution. Our theoretical analysis demonstrates that the low-resolution draft attention closely approximates the full attention, providing reliable guidance for constructing accurate sparse attention. Experimental results show that our method outperforms existing sparse attention approaches in video generation quality and achieves up to 1.75x end-to-end speedup on GPUs. Code: https://github.com/shawnricecake/draft-attention
title	DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2505.14708

Similar Items