Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Aitrouga, Abdelilah, Hmamouche, Youssef, Seghrouchni, Amal El Fallah
Format: Preprint
Veröffentlicht: 2025
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2509.25998
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866918230697705472
author Aitrouga, Abdelilah
Hmamouche, Youssef
Seghrouchni, Amal El Fallah
author_facet Aitrouga, Abdelilah
Hmamouche, Youssef
Seghrouchni, Amal El Fallah
contents In light of recent progress in video editing, deep learning models focusing on both spatial and temporal dependencies have emerged as the primary method. However, these models suffer from the quadratic computational complexity of traditional attention mechanisms, making them difficult to adapt to long-duration and high-resolution videos. This limitation restricts their applicability in practical contexts such as real-time video processing. To tackle this challenge, we introduce a method to reduce both time and space complexity of these systems by proposing VRWKV-Editor, a novel video editing model that integrates a linear spatio-temporal aggregation module into video-based diffusion models. VRWKV-Editor leverages bidirectional weighted key-value recurrence mechanism of the RWKV transformer to capture global dependencies while preserving temporal coherence, achieving linear complexity without sacrificing quality. Extensive experiments demonstrate that the proposed method achieves up to 3.7x speedup and 60% lower memory usage compared to state-of-the-art diffusion-based video editing methods, while maintaining competitive performance in frame consistency and text alignment. Furthermore, a comparative analysis we conducted on videos with different sequence lengths confirms that the gap in editing speed between our approach and architectures with self-attention becomes more significant with long videos.
format Preprint
id arxiv_https___arxiv_org_abs_2509_25998
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing
Aitrouga, Abdelilah
Hmamouche, Youssef
Seghrouchni, Amal El Fallah
Computer Vision and Pattern Recognition
Artificial Intelligence
In light of recent progress in video editing, deep learning models focusing on both spatial and temporal dependencies have emerged as the primary method. However, these models suffer from the quadratic computational complexity of traditional attention mechanisms, making them difficult to adapt to long-duration and high-resolution videos. This limitation restricts their applicability in practical contexts such as real-time video processing. To tackle this challenge, we introduce a method to reduce both time and space complexity of these systems by proposing VRWKV-Editor, a novel video editing model that integrates a linear spatio-temporal aggregation module into video-based diffusion models. VRWKV-Editor leverages bidirectional weighted key-value recurrence mechanism of the RWKV transformer to capture global dependencies while preserving temporal coherence, achieving linear complexity without sacrificing quality. Extensive experiments demonstrate that the proposed method achieves up to 3.7x speedup and 60% lower memory usage compared to state-of-the-art diffusion-based video editing methods, while maintaining competitive performance in frame consistency and text alignment. Furthermore, a comparative analysis we conducted on videos with different sequence lengths confirms that the gap in editing speed between our approach and architectures with self-attention becomes more significant with long videos.
title VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2509.25998