Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.04451 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908939677859840 |
|---|---|
| author | Liu, Hao Huang, Ye Huang, Chenghuan Zheng, Zhenyi Du, Jiangsu Ma, Ziyang Lyu, Jing Lu, Yutong |
| author_facet | Liu, Hao Huang, Ye Huang, Chenghuan Zheng, Zhenyi Du, Jiangsu Ma, Ziyang Lyu, Jing Lu, Yutong |
| contents | Video Diffusion Transformer (DiT) models are a dominant approach for high-quality video generation but suffer from high inference cost due to iterative denoising. Existing caching approaches primarily exploit similarity within the diffusion process of a single request to skip redundant denoising steps. In this paper, we introduce Chorus, a caching approach that leverages similarity across requests to accelerate video diffusion model serving. Chorus achieves up to 45\% speedup on industrial 4-step distilled models, where prior intra-request caching approaches are ineffective. Particularly, Chorus employs a three-stage caching strategy along the denoising process. Stage 1 performs full reuse of latent features from similar requests. Stage 2 exploits inter-request caching in specific latent regions during intermediate denoising steps. This stage is combined with Token-Guided Attention Amplification to improve semantic alignment between the generated video and the conditional prompts, thereby extending the applicability of full reuse to later denoising steps. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2604_04451 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse Liu, Hao Huang, Ye Huang, Chenghuan Zheng, Zhenyi Du, Jiangsu Ma, Ziyang Lyu, Jing Lu, Yutong Computer Vision and Pattern Recognition Video Diffusion Transformer (DiT) models are a dominant approach for high-quality video generation but suffer from high inference cost due to iterative denoising. Existing caching approaches primarily exploit similarity within the diffusion process of a single request to skip redundant denoising steps. In this paper, we introduce Chorus, a caching approach that leverages similarity across requests to accelerate video diffusion model serving. Chorus achieves up to 45\% speedup on industrial 4-step distilled models, where prior intra-request caching approaches are ineffective. Particularly, Chorus employs a three-stage caching strategy along the denoising process. Stage 1 performs full reuse of latent features from similar requests. Stage 2 exploits inter-request caching in specific latent regions during intermediate denoising steps. This stage is combined with Token-Guided Attention Amplification to improve semantic alignment between the generated video and the conditional prompts, thereby extending the applicability of full reuse to later denoising steps. |
| title | Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2604.04451 |